onsite
LLM Ops Engineer
Research Engineer
Lead the operational excellence of large language model deployments, ensuring scalable, reliable, and monitored services on AWS with A/B testing, autoscaling, and robust alerting.
About the role
Key Responsibilities
- Design, implement, and maintain scalable LLM deployment pipelines on AWS, leveraging autoscaling and load balancing.
- Configure and manage A/B testing frameworks to evaluate model variants and performance metrics.
- Set up comprehensive alerting and monitoring solutions to detect anomalies and ensure high availability.
- Collaborate with data scientists and ML engineers to integrate new models into production workflows.
- Automate deployment processes using CI/CD tools, ensuring rapid and reliable releases.
Requirements
- Proven experience with AWS services (ECS/EKS, Lambda, CloudWatch, Auto Scaling).
- Strong background in DevOps practices, CI/CD pipelines, and infrastructure as code.
- Hands‑on knowledge of A/B testing methodologies and performance monitoring.
- Excellent scripting skills (Python, Bash) and familiarity with containerization.
- Ability to troubleshoot complex production issues and optimize system performance.