onsite
Software Engineer, Data Infrastructure & Acquisition - Savannah, GA, USA - Speechify
Software Engineer
Lead the design and implementation of scalable data pipelines that ingest, transform, and store large volumes of content for real‑time text‑to‑speech services, leveraging Python, AWS, Spark, and Kafka to ensure high availability and performance.
About the role
Key Responsibilities
- Architect and build robust, fault‑tolerant data pipelines that ingest content from diverse sources (PDFs, web pages, documents) into a unified data lake.
- Implement ETL processes using Python, Spark, and SQL to clean, enrich, and transform raw data for downstream analytics and model training.
- Integrate streaming data with Kafka to support real‑time content ingestion and processing.
- Collaborate with data scientists and product teams to expose high‑quality datasets for machine learning and feature engineering.
- Optimize pipeline performance, monitor throughput, and troubleshoot production issues across AWS services (S3, Glue, EMR, Redshift).
- Document architecture, data schemas, and best practices to enable cross‑team knowledge sharing.
Requirements
- 5+ years of experience building large‑scale data infrastructure in a cloud environment.
- Proficiency in Python, SQL, and Apache Spark for batch and streaming workloads.
- Hands‑on experience with AWS services (S3, Glue, EMR, Redshift, Kinesis) and Kafka.
- Strong understanding of data modeling, ETL design patterns, and performance tuning.
- Excellent communication skills and a collaborative mindset in a distributed team setting.
Skills
pythonawssqlapache sparkkafka