onsite
Software Engineer, Data Infrastructure & Acquisition - Atlanta, GA, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to ingest, transform, and serve large volumes of content for Speechify’s text‑to‑speech platform, leveraging Python, AWS, and modern streaming/ETL technologies.
About the role
Key Responsibilities
- Design, build, and maintain robust data pipelines that ingest raw content from diverse sources (PDFs, web pages, documents) into a unified data lake.
- Implement ETL workflows using Python, Apache Spark, and AWS services (S3, Glue, Redshift) to transform and enrich data for downstream analytics and model training.
- Develop real‑time streaming solutions with Kafka and Kinesis to support low‑latency content processing and feature extraction.
- Collaborate with data scientists and product teams to define schema, data quality rules, and performance benchmarks.
- Monitor pipeline health, troubleshoot failures, and continuously optimize throughput and cost efficiency.
Requirements
- 5+ years of experience building production‑grade data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and Spark for large‑scale data processing.
- Hands‑on experience with AWS data services (S3, Glue, Redshift, Kinesis) and streaming platforms like Kafka.
- Solid understanding of data modeling, ETL best practices, and performance tuning.
- Excellent problem‑solving skills and a collaborative mindset in a distributed team setting.
Skills
pythonawssqlapache sparkkafka