remote
Application SRE AI Solutions Engineer - PepsiCo
Software Engineer
Self‑driven engineer who designs resilient architectures for ML models, LLMs, and AI agents, applying SRE and QA principles to production pipelines using cloud, container, and IaC technologies.
About the role
Key Responsibilities
- Design and implement robust architectural patterns for Machine Learning models, Large Language Models, AI agents, and computer‑vision solutions that embed Site Reliability Engineering (SRE) and quality‑assurance practices.
- Collaborate with Data and AI Architecture teams to align solutions with corporate AI strategy, technical standards, and governance.
- Develop and maintain CI/CD pipelines, infrastructure‑as‑code, and observability stacks to enable shift‑left testing, automated validation, and rapid, safe deployments.
- Implement proactive monitoring, alerting, and incident‑response frameworks to minimize production impact and ensure high availability.
- Drive automation of scaling, fault‑tolerance, and security controls across cloud (AWS) and container (Kubernetes/Docker) environments.
Requirements
- Strong software engineering background with 3+ years of experience in Python and cloud‑native development.
- Hands‑on expertise with Kubernetes, Docker, and IaC tools such as Terraform.
- Deep understanding of ML/LLM lifecycle, model serving, and AI‑specific reliability challenges.
- Experience building CI/CD pipelines, monitoring, and alerting systems for production AI workloads.
- Proven ability to work cross‑functionally, translate architectural patterns into production code, and drive continuous improvement.
Skills
pythonkubernetesdockerawsterraformmachine learning