onsite
Software Engineer, Data Infrastructure & Acquisition - Phoenix, AZ, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to ingest, transform, and serve large volumes of content for a global text‑to‑speech platform, leveraging Python, AWS, and distributed processing tools.
About the role
Key Responsibilities
- Architect and build robust, scalable data pipelines that ingest raw content from diverse sources (PDFs, web pages, documents) into the data lake.
- Implement ETL workflows using Python, Spark, and SQL to clean, enrich, and transform data for downstream analytics and model training.
- Collaborate with data scientists and product teams to define data schemas, quality metrics, and performance benchmarks.
- Optimize pipeline performance and cost on AWS (S3, Glue, Redshift, EMR) while ensuring high availability and fault tolerance.
- Monitor, troubleshoot, and continuously improve data ingestion, processing, and storage solutions.
Requirements
- 5+ years of experience building production data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and distributed processing frameworks (Spark, Flink).
- Hands‑on experience with AWS services (S3, Glue, Redshift, EMR, Lambda).
- Solid understanding of data modeling, ETL best practices, and data quality principles.
- Excellent problem‑solving skills and a passion for building reliable, scalable data infrastructure.
Skills
pythonawssqlapache sparkkafka