remote
Distinguished Software Engineer - AI/ML Engineer Agentic Systems & Site Reliability - Walmart
ML Engineer
Lead the design and deployment of next‑generation agentic AI systems and autonomous automation solutions, ensuring mission‑critical reliability and scalability across a global technology ecosystem.
About the role
Key Responsibilities
- Architect and implement scalable machine learning platforms that power autonomous agents for monitoring, prediction, and automated issue resolution.
- Drive end‑to‑end development of agentic AI solutions, from data ingestion and model training to deployment and continuous improvement.
- Collaborate with cross‑functional teams to integrate AI capabilities into site reliability workflows, enhancing operational excellence and reducing mean time to recovery.
- Leverage AWS services (SageMaker, Lambda, Step Functions) to build resilient, high‑throughput pipelines and real‑time inference endpoints.
- Establish best practices for model governance, monitoring, and performance tuning in a production environment.
Requirements
- Extensive experience (10+ years) in software engineering with a focus on AI/ML and site reliability.
- Proficiency in Python, distributed systems, and cloud‑native architecture on AWS.
- Deep knowledge of autonomous agent design, reinforcement learning, and production‑grade ML pipelines.
- Strong background in monitoring, observability, and incident response at scale.
- Excellent communication skills and a proven track record of leading high‑impact technical initiatives.
Skills
pythonmachine learningaws