onsite
Software Engineer, Data Infrastructure & Acquisition - Columbus, OH, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines that ingest, transform, and store large volumes of content for real‑time text‑to‑speech services, leveraging Python, AWS, and distributed processing frameworks.
About the role
Key Responsibilities
- Architect and build robust, fault‑tolerant data pipelines to ingest PDFs, books, and web content at scale.
- Implement ETL workflows using Python, SQL, and Apache Spark on AWS services (S3, Glue, Redshift).
- Integrate streaming data sources with Kafka to support real‑time content processing.
- Collaborate with product and ML teams to expose clean, high‑quality datasets for downstream services.
- Monitor pipeline performance, troubleshoot issues, and continuously optimize throughput and cost.
Requirements
- 5+ years of experience building production data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and distributed data processing (Spark, Flink).
- Hands‑on experience with AWS data services (S3, Glue, Redshift, EMR).
- Solid understanding of Kafka or similar streaming platforms.
- Excellent problem‑solving skills and a passion for clean, maintainable code.
Skills
pythonawssqlapache sparkkafka