onsite
Software Engineer, Data Infrastructure & Acquisition - Boston, MA, USA - Speechify
Software Engineer
Lead the design and scaling of Speechify’s data ingestion and processing pipelines, leveraging Python, AWS, Spark, and SQL to ensure reliable, high‑throughput data flows for our text‑to‑speech services.
About the role
Key Responsibilities
- Architect, build, and maintain scalable data pipelines that ingest, transform, and store large volumes of text and metadata from diverse sources (PDFs, web pages, documents).
- Implement robust ETL workflows using Python, Spark, and AWS services (S3, Glue, Redshift, Lambda) to support real‑time and batch processing.
- Collaborate with data scientists and product teams to define data models, optimize query performance, and ensure data quality across the platform.
- Monitor pipeline health, troubleshoot failures, and continuously improve reliability and latency.
- Document architecture, processes, and best practices for internal use and future onboarding.
Requirements
- 3+ years of experience building production data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and Spark for large‑scale data processing.
- Hands‑on experience with AWS services (S3, Glue, Redshift, Lambda, Kinesis).
- Solid understanding of data modeling, ETL best practices, and performance tuning.
- Excellent problem‑solving skills and a collaborative mindset.