onsite
Software Engineer, Data Infrastructure & Acquisition - Oxford, United Kingdom - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines that ingest, transform, and store large volumes of content for real‑time text‑to‑speech services, leveraging Python, AWS, and Spark to power Speechify’s global platform.
About the role
Key Responsibilities
- Architect and develop robust, high‑throughput data pipelines to ingest diverse content sources (PDFs, web pages, documents) into the data lake.
- Implement ETL workflows using Python, Spark, and SQL, ensuring data quality, lineage, and compliance with privacy standards.
- Collaborate with product and ML teams to expose clean, enriched datasets for downstream speech synthesis models.
- Optimize pipeline performance and cost on AWS (S3, Glue, EMR, Redshift) and containerize services with Docker and Kubernetes.
- Monitor, troubleshoot, and continuously improve pipeline reliability using observability tools.
Requirements
- 5+ years of experience building production data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and Spark for large‑scale data processing.
- Hands‑on experience with AWS services (S3, Glue, EMR, Redshift, Lambda).
- Solid understanding of data modeling, ETL best practices, and data governance.
- Excellent problem‑solving skills and a passion for building scalable, maintainable systems.
Skills
pythonawssqlapache sparkdocker