onsite
Software Engineer, Data Infrastructure & Acquisition - Ann Arbor, MI, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines that ingest, transform, and store large volumes of content for text‑to‑speech services, leveraging Python, AWS, and distributed processing frameworks to ensure high availability and performance.
About the role
Key Responsibilities
- Architect and build robust, fault‑tolerant data pipelines that ingest raw content from diverse sources (PDFs, web pages, documents) into a unified data lake.
- Implement ETL workflows using Python, Apache Spark, and AWS services (S3, Glue, Redshift) to transform and enrich data for downstream consumption.
- Integrate real‑time streaming data with Kafka and Kinesis to support live content ingestion and analytics.
- Collaborate with cross‑functional teams to define data models, schema evolution, and metadata management.
- Monitor pipeline performance, troubleshoot issues, and continuously optimize for cost and latency.
Requirements
- 5+ years of experience building production‑grade data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and distributed processing frameworks (Spark, Flink).
- Hands‑on experience with AWS data services (S3, Glue, Redshift, Athena) and streaming platforms (Kafka, Kinesis).
- Solid understanding of data modeling, schema design, and data quality best practices.
- Excellent problem‑solving skills and a passion for building scalable, maintainable systems.
Skills
pythonawssqlapache sparkkafka