onsite
Senior Expert Site Reliability Engineer - Vodafone GmbH
Site Reliability Engineer
Lead the design, implementation, and maintenance of highly available, scalable infrastructure using Kubernetes, Docker, and AWS. Drive automation, monitoring, and incident response to ensure optimal system performance and reliability.
About the role
Key Responsibilities
- Architect and maintain production-grade Kubernetes clusters, ensuring high availability and scalability across multiple regions.
- Design and implement CI/CD pipelines with GitOps principles, automating deployments and rollbacks.
- Develop and maintain monitoring dashboards using Prometheus and Grafana, and set up alerting for critical incidents.
- Collaborate with development teams to embed reliability best practices into application design.
- Lead incident investigations, root cause analysis, and post‑mortem documentation to continuously improve system resilience.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Deep expertise with Kubernetes, Docker, and cloud platforms (AWS preferred).
- Strong scripting skills in Python and experience with IaC tools (Terraform, CloudFormation).
- Proven track record of building automated monitoring, alerting, and incident response workflows.
- Excellent communication skills and ability to mentor junior engineers.
Skills
kubernetesdockerawsprometheusgrafanapython