onsite
AI/GenAI Platform Site Reliability Engineer - Purple Drive Technologies LLC
AI Engineer
Site Reliability Engineer focused on AI/GenAI platforms, responsible for building and operating cloud‑native infrastructure, automating deployments, and ensuring high availability of large language model services.
About the role
Key Responsibilities
- Design, implement, and maintain scalable Kubernetes clusters and supporting services for large language model inference and training workloads.
- Develop IaC pipelines using Terraform and AWS services to provision, update, and de‑commission infrastructure reliably.
- Build CI/CD workflows (GitHub Actions, Jenkins, or similar) that automate model packaging, containerization with Docker, and rollout to production.
- Implement observability stack (Prometheus, Grafana, Loki) to monitor latency, error rates, and resource utilization, and create alerting policies.
- Collaborate with data scientists and application developers to integrate ML‑ops best practices, ensuring reproducible model deployments and version control.
Requirements
- 3+ years of SRE or DevOps experience in cloud environments, preferably AWS.
- Strong proficiency in Python for automation, scripting, and tooling.
- Hands‑on experience with Kubernetes, Docker, and infrastructure‑as‑code tools such as Terraform.
- Familiarity with CI/CD pipelines, monitoring solutions (Prometheus/Grafana), and incident response processes.
- Understanding of machine learning workflows, LLM serving patterns, and performance tuning for AI workloads.
Skills
pythonkubernetesterraformawscicdprometheusgrafanadocker