remote

Senior Staff Reliability Engineer Director - MSD

Software Engineer

Leads reliability engineering for digital platforms, defining and maturing SRE practices, observability, and resilience. Drives automation, incident response, and monitoring across cloud infrastructure to ensure scalable, resilient services for critical scientific and business outcomes.

About the role

Job Description

Join our company as we transform and innovate. We are at the forefront of delivering reliable, scalable, and resilient digital solutions that support critical scientific and business outcomes across our global organization. Our Digital Platforms & Services organization provides the technical foundation powering our company’s applications. We are seeking a highly experienced engineer who brings deep expertise in Site Reliability Engineering (SRE), Observability, and Resilience to help define and mature our reliability engineering practices. As a Senior Principal Reliability Engineer, you will lead the evolution of how reliability is engineered, measured, and improved across IT systems. You will play a critical role in enabling engineering teams to build systems that are reliable by design, while shaping enterprise practices that scale across the organization. This is a highly visible and impactful role with the potential to significantly improve the reliability, resilience, and operational effectiveness of the IT products that power our company’s mission.

Responsibilities

Build relationships across the broader IT organization to increase adoption and maturity of SRE, Observability, and Resilience practices

Define and evolve the strategic vision for enterprise reliability engineering and ensure alignment across product, platform, and ITSM teams

Establish and enforce standards for Service Level Objectives, observability frameworks, and resilience engineering practices

Collaborate with engineering teams to ensure reliability is embedded into architecture, design, and delivery processes

Drive adoption of Service Level Objectives using Nobl9 as the system of record for reliability governance

Lead evaluation and introduction of new technologies that improve reliability outcomes while integrating with existing platforms

Apply AI capabilities to enhance reliability practices, including incident triage, diagnostics, and automation, in a governed and controlled manner

Collaborate within efforts to standardize observability across logs, metrics, traces, and events to improve system visibility and decision-making

Consult and promote resilience patterns including fault isolation, failover strategies, and recovery mechanisms

Guide improvements surrounding incident lifecycle effectiveness, including detection, response, root cause analysis, and continuous improvement

Lead and mentor a community of reliability practitioners to grow organizational capability and maturity

Represent reliability engineering practice in architecture reviews, governance forums, and key IT initiatives

Drive continuous improvement of reliability practices through research, innovation, and feedback from engineering teams

Required Minimum Qualifications:

Bachelors degree in IT, Engineering, Computer Science, or related f