onsite
Software Engineer, Data Infrastructure & Acquisition - Eugene, OR, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to support Speechify’s text‑to‑speech services, leveraging Python, AWS, and Spark to ingest, transform, and store large volumes of content data.
About the role
Key Responsibilities
- Design, build, and maintain robust data pipelines that ingest content from diverse sources (PDFs, web pages, documents) into the data lake.
- Implement ETL processes using Python and Apache Spark to clean, enrich, and transform raw data for downstream analytics and model training.
- Collaborate with data scientists and product teams to define data schemas, quality metrics, and performance benchmarks.
- Optimize pipeline performance and cost on AWS (S3, Glue, Redshift, Athena) while ensuring high availability and fault tolerance.
- Monitor, troubleshoot, and continuously improve data workflows, implementing automated alerts and logging.
Requirements
- 3+ years of experience building production data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and Spark for large‑scale data processing.
- Hands‑on experience with AWS services (S3, Glue, Redshift, Athena, Lambda).
- Solid understanding of data modeling, ETL best practices, and data quality principles.
- Excellent problem‑solving skills and a collaborative mindset in a distributed team setting.
Skills
pythonawssqlapache spark