onsite
Software Engineer, Data Infrastructure & Acquisition - Cambridge, United Kingdom - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to support Speechify’s text‑to‑speech services, leveraging Python, AWS, and Spark to ingest, transform, and store large volumes of content data.
About the role
Key Responsibilities
- Architect, develop, and maintain robust data pipelines that ingest content from diverse sources (PDFs, web pages, documents) into a unified data lake.
- Implement ETL workflows using Python, Spark, and SQL to clean, enrich, and transform raw data for downstream analytics and model training.
- Collaborate with data scientists and product teams to define data requirements and ensure high‑quality, reproducible datasets.
- Optimize pipeline performance and cost on AWS, utilizing services such as S3, Glue, EMR, and Lambda.
- Containerize services with Docker and orchestrate deployments using Kubernetes or ECS for high availability.
- Monitor pipeline health, troubleshoot failures, and continuously improve reliability and scalability.
Requirements
- 5+ years of experience building production‑grade data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and Apache Spark.
- Hands‑on experience with AWS data services (S3, Glue, EMR, Lambda).
- Knowledge of containerization (Docker) and orchestration (Kubernetes or ECS).
- Excellent problem‑solving skills and a passion for clean, maintainable code.
Skills
pythonawssqlapache sparkdocker