onsite
Software Engineer, Data Infrastructure & Acquisition - Chennai, India - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to ingest, process, and store large volumes of content for a global text‑to‑speech platform, leveraging Python, AWS, and distributed data tools.
About the role
Key Responsibilities
- Architect and build robust, fault‑tolerant data pipelines that ingest raw content from diverse sources (PDFs, web pages, documents) into the platform’s data lake.
- Develop and maintain ETL workflows using Python, Apache Spark, and AWS services (S3, Glue, Redshift, Athena) to transform and enrich data for downstream analytics and ML models.
- Implement real‑time streaming ingestion with Kafka, ensuring low latency and high throughput for content updates.
- Collaborate with data scientists and product teams to define data schemas, quality metrics, and performance benchmarks.
- Optimize query performance and storage costs through partitioning, indexing, and cost‑effective data lake design.
- Monitor pipeline health, troubleshoot failures, and continuously improve reliability and scalability.
Requirements
- 5+ years of experience building production‑grade data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and distributed data processing frameworks (Spark, Flink).
- Hands‑on experience with AWS data services (S3, Glue, Redshift, Athena, Kinesis).
- Solid understanding of streaming architectures and Kafka.
- Excellent problem‑solving skills and a passion for clean, maintainable code.
Skills
pythonawssqlapache sparkkafka