remote
Infrastructure Engineer - Developer Platform
Devops Engineer
Lead the design and operation of a scalable developer platform, building robust cloud infrastructure, CI/CD pipelines, and Kubernetes clusters to support AI model training and deployment.
About the role
Key Responsibilities
- Architect, deploy, and maintain a highly available, secure cloud infrastructure on AWS to support large‑scale AI workloads.
- Design and implement CI/CD pipelines using GitHub Actions, Terraform, and Docker to automate model training, testing, and deployment.
- Manage Kubernetes clusters, ensuring efficient resource utilization, autoscaling, and rolling updates for production services.
- Collaborate with data scientists and ML engineers to optimize infrastructure for low‑latency inference and high‑throughput training.
- Implement monitoring, logging, and alerting solutions (Prometheus, Grafana, CloudWatch) to guarantee platform reliability and performance.
Requirements
- 5+ years of experience building and operating cloud infrastructure for AI/ML workloads.
- Proficiency with AWS services (EC2, EKS, S3, RDS, CloudWatch) and Kubernetes administration.
- Strong scripting skills in Python and experience with Terraform or similar IaC tools.
- Hands‑on experience with CI/CD tooling (GitHub Actions, Jenkins, ArgoCD) and containerization (Docker, OCI).
- Excellent problem‑solving skills, ability to work in a fast‑moving, cross‑functional team.
Skills
awskubernetesterraformpythoncicd