onsite
Software Engineer, Data Infrastructure & Acquisition - New York, NY, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to support Speechify’s text‑to‑speech services, leveraging Python, AWS, Spark, and SQL to ingest, transform, and store large volumes of content data.
About the role
Key Responsibilities
- Design, develop, and maintain robust data ingestion pipelines that process diverse content sources (PDFs, books, web pages, documents) into structured formats for downstream TTS services.
- Implement scalable ETL workflows using Spark and Python, ensuring high throughput and low latency for real‑time data processing.
- Collaborate with data scientists and product teams to define data models, schema, and metadata standards that support analytics and feature development.
- Optimize data storage and retrieval on AWS (S3, Redshift, Athena) to reduce costs and improve query performance.
- Monitor pipeline health, troubleshoot failures, and continuously improve reliability and observability using CloudWatch, Prometheus, and Grafana.
- Document architecture, processes, and best practices for internal knowledge sharing.
Requirements
- 5+ years of experience building production‑grade data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and Spark (PySpark).
- Hands‑on experience with AWS services (S3, Redshift, Glue, Athena, Lambda).
- Solid understanding of data modeling, ETL design patterns, and performance tuning.
- Excellent problem‑solving skills and a collaborative mindset.