onsite
AI Reliability Engineer SRE for Generative AI Systems - Galaxy i Technologies, Inc.
AI Engineer
Reliability engineer focused on building, scaling, and monitoring generative AI services on cloud infrastructure, using Python, Kubernetes, Terraform, and AWS to ensure high availability and performance of ML pipelines.
About the role
Key Responsibilities
- Design, implement, and operate highly available, scalable infrastructure for generative AI workloads on AWS.
- Develop automation scripts and IaC (Terraform) to provision, configure, and manage Kubernetes clusters supporting model training and inference.
- Implement observability stack (Prometheus, Grafana, logging) to monitor latency, throughput, and resource utilization of AI services.
- Collaborate with data scientists and application developers to integrate ML pipelines into production, ensuring reproducibility and CI/CD compliance.
- Respond to incidents, perform root‑cause analysis, and drive continuous improvement of reliability and performance.
Requirements
- 3+ years of SRE or DevOps experience with cloud platforms, preferably AWS.
- Strong proficiency in Python for automation and tooling.
- Hands‑on experience with Kubernetes orchestration and Terraform for infrastructure as code.
- Familiarity with ML‑Ops concepts, model serving frameworks, and monitoring of AI workloads.
- Solid understanding of networking, security, and performance tuning in distributed systems.
Skills
pythonkubernetesterraformawsprometheus