onsite

AI Reliability Engineer SRE - GenAI Systems - TekCommands Inc

AI Engineer

Lead the design, deployment, and monitoring of GenAI services, ensuring high availability, scalability, and reliability across cloud and ML infrastructure using Python, AWS, Kubernetes, and CI/CD pipelines.

About the role

Key Responsibilities

Design, build, and maintain scalable GenAI application and infrastructure on AWS, leveraging Kubernetes, Docker, and Terraform.
Implement robust monitoring, alerting, and incident response for AI services, ensuring 99.9% uptime.
Collaborate with ML teams to integrate model training pipelines into CI/CD workflows, automating deployments with GitOps principles.
Optimize resource utilization and cost across cloud environments, applying autoscaling and spot instance strategies.
Develop and enforce SRE best practices, including chaos engineering, capacity planning, and post‑mortem analysis.

Requirements

5+ years of experience in SRE or DevOps roles, with a focus on AI/ML workloads.
Proficiency in Python, AWS services (EKS, ECS, S3, Lambda), and Kubernetes cluster management.
Hands‑on experience with Docker, Terraform, and CI/CD tools (GitHub Actions, ArgoCD).
Strong understanding of ML Ops concepts, model versioning, and data pipeline orchestration.
Excellent problem‑solving skills, ability to work in a fast‑paced, cross‑functional team.

Skills

pythonawskubernetesdockerterraformcicd

CompanyTekCommands Inc

DepartmentEngineering

LocationMichigan, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 26, 2026