remote
Site Reliability Engineer, Inference Infrastructure
Site Reliability Engineer, Inference Infrastructure
Cohere is looking for a Site Reliability Engineer, Inference Infrastructure to ensure the high availability, reliability, and performance of its inference infrastructure. This role involves optimizing systems, troubleshooting production issues, and developing automation for operational efficiency across multiple global locations.
About the role
About the Role
Cohere is seeking a Site Reliability Engineer, Inference Infrastructure. This role focuses on the reliability and performance of our inference infrastructure.
Responsibilities
- Ensure high availability and reliability of Cohere's inference systems.
- Optimize the performance and scalability of inference infrastructure.
- Work across teams to troubleshoot and resolve production issues.
- Implement and maintain monitoring, alerting, and logging solutions.
- Develop automation to streamline operational tasks.
Requirements
- Experience with site reliability engineering or a similar role.
- Strong background in managing large-scale distributed systems.
- Proficiency in cloud platforms (e.g., AWS, GCP, Azure).
- Experience with containerization and orchestration (e.g., Docker, Kubernetes).
- Solid understanding of Linux operating systems.
- Ability to work in a fast-paced, dynamic environment.