onsite
Software Engineer, Data Infrastructure & Acquisition - Santa Clara, CA, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines and infrastructure to ingest, process, and serve large volumes of content for Speechify’s text‑to‑speech platform, leveraging Python, AWS, and distributed data tools.
About the role
Key Responsibilities
- Architect, build, and maintain robust data ingestion pipelines that transform raw content from PDFs, books, and web sources into structured formats for downstream services.
- Collaborate with cross‑functional teams to define data models, schema, and quality standards ensuring high reliability and performance.
- Optimize and scale batch and streaming workflows using Apache Spark, Kafka, and AWS services (S3, Glue, Redshift).
- Implement monitoring, alerting, and automated testing to guarantee pipeline uptime and data integrity.
- Drive continuous improvement by evaluating new technologies, refactoring legacy code, and sharing best practices across the engineering organization.
Requirements
- 5+ years of experience building production‑grade data pipelines in a cloud environment.
- Strong proficiency in Python, SQL, and distributed processing frameworks (Spark, Flink).
- Hands‑on experience with AWS data services (S3, Glue, Redshift, Athena) and streaming platforms (Kafka, Kinesis).
- Solid understanding of data modeling, ETL best practices, and performance tuning.
- Excellent problem‑solving skills, ability to work independently in a distributed team, and a passion for clean, maintainable code.
Skills
pythonawssqlapache sparkkafka