onsite
AI Reliability Engineer SRE - GenAI Systems - TekCommands Inc
AI Engineer
Lead the design, deployment, and monitoring of GenAI services, ensuring high availability, scalability, and reliability across cloud and ML infrastructure using Python, AWS, Kubernetes, and CI/CD pipelines.
About the role
Key Responsibilities
- Design, build, and maintain scalable GenAI application and infrastructure on AWS, leveraging Kubernetes, Docker, and Terraform.
- Implement robust monitoring, alerting, and incident response for AI services, ensuring 99.9% uptime.
- Collaborate with ML teams to integrate model training pipelines into CI/CD workflows, automating deployments with GitOps principles.
- Optimize resource utilization and cost across cloud environments, applying autoscaling and spot instance strategies.
- Develop and enforce SRE best practices, including chaos engineering, capacity planning, and post‑mortem analysis.
Requirements
- 5+ years of experience in SRE or DevOps roles, with a focus on AI/ML workloads.
- Proficiency in Python, AWS services (EKS, ECS, S3, Lambda), and Kubernetes cluster management.
- Hands‑on experience with Docker, Terraform, and CI/CD tools (GitHub Actions, ArgoCD).
- Strong understanding of ML Ops concepts, model versioning, and data pipeline orchestration.
- Excellent problem‑solving skills, ability to work in a fast‑paced, cross‑functional team.
Skills
pythonawskubernetesdockerterraformcicd