Platform Engineer, AI ML Infrastructure
Staff Platform Engineer, AI ML Infrastructure position — see original posting for full details.
Staf f Platform Engineer, AI/ML Infrastructure
Department:AI Software & Operations
Role Summary The Staff Platform Engineer, AI/ML Infrastructure will provide technical leadership for thecloud platforms, deployment systems, and operational foundations that power enterprise-scalegenerative AI applications. This role will define and evolve the infrastructure architecture for AI/ML platforms running across AWS,Kubernetes, serverless, and containerized environments. The engineer will lead platform standards forreliability, scalability, observability, CI/CD, security, and developer enablement, while partnering closelywith software engineering, AI engineering, security, and operations teams. The ideal candidate combines deep hands-on cloud engineering experience with staff-level technicalinfluence. They are comfortable designing infrastructure patterns, writing infrastructure-as-code,improving delivery pipelines, mentoring engineers, and making architectural decisions that raise theoperational maturity of AI platforms across multiple teams. Key Responsibilities Define and drive the technical strategy for AI/ML platform infrastructure supporting generative AIapplications, LLM integrations, model routing, and enterprise AI services. Architect, build, and operate scalable cloud platforms using AWS services such as EKS, ECSFargate, Lambda, DynamoDB, S3, OpenSearch, Secrets Manager, CloudWatch, ALB, and MWAA. Establish reusable infrastructure patterns using CloudFormation, Helm, and Terraform to supportreliable multi-environment and multi-region deployments. Lead CI/CD architecture using GitHub Actions, reusable workflows, OIDC-based AWSauthentication, automated quality gates, deployment promotion, and environment approvals. Design and improve observability across AI platforms, including CloudWatch dashboards, logs,alarms, Prometheus/Grafana, OpenSearch, Langfuse, and LLM-specific operational metrics. Build platform capabilities for GenAI workloads, including model availability monitoring. Partner with software engineering teams to improve deployment reliability, rollback strategies,health checks, autoscaling, load testing, and runtime performance. Define and enforce security and compliance practices for infrastructure, including IAM permissionboundaries, Secrets Manager usage, secret scanning, audit logging, tagging standards, andchange-management controls. Provide technical leadership for cost optimization, capacity planning, environment standardization,and operational resilience across development, test, production, and sandbox environments. Mentor engineers, review architecture and infrastructure designs, and influence platformengineering practices across teams.
Basic Qualifications Bachelor’s degree in Computer Science, Engineering, Information Technology, or a relatedtechnical field, or equivalent practical experience. 7+ years of experience in DevOps, platform engineering, cloud infrastructure, site reliabilityen
Posted June 10, 2026