Software Engineer
Leads reliability engineering for digital platforms, defining and maturing SRE practices, observability, and resilience. Drives automation, incident response, and monitoring across cloud infrastructure to ensure scalable, resilient services for critical scientific and business outcomes.
Job Description
Join our company as we transform and innovate. We are at the forefront of delivering reliable, scalable, and resilient digital solutions that support critical scientific and business outcomes across our global organization. Our Digital Platforms & Services organization provides the technical foundation powering our company’s applications. We are seeking a highly experienced engineer who brings deep expertise in Site Reliability Engineering (SRE), Observability, and Resilience to help define and mature our reliability engineering practices. As a Senior Principal Reliability Engineer, you will lead the evolution of how reliability is engineered, measured, and improved across IT systems. You will play a critical role in enabling engineering teams to build systems that are reliable by design, while shaping enterprise practices that scale across the organization. This is a highly visible and impactful role with the potential to significantly improve the reliability, resilience, and operational effectiveness of the IT products that power our company’s mission.
Responsibilities
Build relationships across the broader IT organization to increase adoption and maturity of SRE, Observability, and Resilience practices
Define and evolve the strategic vision for enterprise reliability engineering and ensure alignment across product, platform, and ITSM teams
Establish and enforce standards for Service Level Objectives, observability frameworks, and resilience engineering practices
Collaborate with engineering teams to ensure reliability is embedded into architecture, design, and delivery processes
Drive adoption of Service Level Objectives using Nobl9 as the system of record for reliability governance
Lead evaluation and introduction of new technologies that improve reliability outcomes while integrating with existing platforms
Apply AI capabilities to enhance reliability practices, including incident triage, diagnostics, and automation, in a governed and controlled manner
Collaborate within efforts to standardize observability across logs, metrics, traces, and events to improve system visibility and decision-making
Consult and promote resilience patterns including fault isolation, failover strategies, and recovery mechanisms
Guide improvements surrounding incident lifecycle effectiveness, including detection, response, root cause analysis, and continuous improvement
Lead and mentor a community of reliability practitioners to grow organizational capability and maturity
Represent reliability engineering practice in architecture reviews, governance forums, and key IT initiatives
Drive continuous improvement of reliability practices through research, innovation, and feedback from engineering teams
Required Minimum Qualifications:
Bachelors degree in IT, Engineering, Computer Science, or related f
Posted June 25, 2026