remoteonsite
Python Pyspark AWS - CGI
Software Engineer
Lead data engineering projects using Python and PySpark on AWS, designing scalable pipelines, optimizing performance, and ensuring data quality for enterprise clients.
About the role
Key Responsibilities
- Design, develop, and maintain large-scale data pipelines using Python and PySpark on AWS services such as EMR, S3, and Redshift.
- Collaborate with data scientists and business stakeholders to translate analytical requirements into robust ETL solutions.
- Optimize Spark jobs for performance and cost, implementing best practices for partitioning, caching, and resource allocation.
- Implement data quality checks, monitoring, and alerting to ensure reliability and compliance with data governance standards.
- Document architecture, code, and processes, and mentor junior engineers on Spark and AWS best practices.
Requirements
- 5+ years of experience in data engineering with strong proficiency in Python and PySpark.
- Hands‑on experience deploying and managing Spark workloads on AWS (EMR, Glue, Lambda).
- Solid understanding of SQL, relational and NoSQL databases, and data modeling.
- Experience with CI/CD pipelines, version control (Git), and automated testing.
- Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.