remote

Customer Site Reliability Engineer - OpenShift Managed Cloud Services Kubernete - Red Hat

Site Reliability Engineer

Customer Site Reliability Engineer focused on OpenShift Managed Cloud Services, ensuring high availability and performance of critical services at scale using Kubernetes, automation, monitoring, and incident management techniques.

About the role

Red Hat are looking for a Customer Site Reliability Engineer (CSRE) to join our OpenShift Managed Cloud Services (MCS) team. The CSRE plays a crucial role in ensuring the availability, reliability, and performance of critical services at scale. This role is responsible for independently managing complex systems and solving intricate problems that have a significant impact on service quality and stability.

A CSRE has a customer-first mindset and will act as a technical lead for customer escalations applying expert troubleshooting to ensure timely and effective resolutions that maintain trust and confidence. They will leverage extensive experience in software, and systems engineering to automate operations, reduce toil, and drive continuous improvement across the service lifecycle. They work autonomously, demonstrating strong judgment and decision-making capabilities while managing non-routine assignments.

Collaboration is essential, as you will partner with Technical Account Managers, Services, Fleet SRE, DevOps, and infrastructure teams to address customer-specific and fleet-wide issues, ensuring the stability and functionality of our cloud-based systems.

As a champion of Knowledge-Centered Support (KCS), you will document resolutions, root causes, and best practices to enrich the knowledge base and promote self-service solutions. Additionally, you will mentor team members, fostering a collaborative and continuously learning culture that equips them to manage complex challenges.

This role is ideal for a highly skilled and motivated individual who thrives in a fast-paced, collaborative environment and is passionate about driving reliability, scalability, and customer satisfaction.

What you will do

Manage large-scale, distributed systems, focusing on minimizing downtime and improving system resilience.

Maintain customer trust and confidence by ensuring stability and functionality of services.

Drive continuous enhancement of processes, tools, and methodologies to support the evolving needs of the service.

Lead the development of code and automation scripts to optimize the scalability, reliability, and performance of services.

Lead and participate in high-priority customer escalations, adopting a customer-first mindset.

Coordinate and execute complex incident response procedures, ensuring timely resolution and thorough postmortems.

Collaborate with cross-functional teams to enhance system robustness.

Demonstrate a proactive mindset to help preempt escalations and ensure reliable operations.

Document resolutions, root causes, and best practices to enrich the knowledge base and promote self-service solutions.

Mentor and coach team members, fostering a culture of continuous learning, knowledge sharing and collaboration.

Participate in on-call rotation and provide leadership during critical incidents.

Collaborate on strategic AI and a