remote
Senior Staff Reliability Engineer - Merck
Software Engineer
Lead the design and evolution of reliability practices for global digital platforms, driving observability, automation, and resilient architecture using Kubernetes, Terraform, and Python.
About the role
Key Responsibilities
- Define and mature reliability engineering standards, SLAs, and error budgets for mission‑critical applications.
- Design, implement, and operate observability pipelines (metrics, logs, traces) to provide real‑time insight into system health.
- Automate deployment, scaling, and self‑healing of services using Kubernetes, Terraform, and CI/CD tooling.
- Lead incident response, root‑cause analysis, and post‑mortem processes to drive continuous improvement.
- Collaborate with development, security, and product teams to embed resilience patterns into the software development lifecycle.
Requirements
- 10+ years of experience in Site Reliability Engineering or related roles, with a proven track record of building highly available systems.
- Deep expertise in cloud native technologies such as Kubernetes, containers, and infrastructure‑as‑code (Terraform, CloudFormation).
- Strong programming/scripting skills in Python (or Go) for automation and tooling development.
- Extensive experience with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK) and incident management frameworks.
- Excellent communication and leadership abilities to influence cross‑functional teams and mentor junior engineers.
Skills
kubernetesterraformpython