remote
Production Support Engineer - SRE - Capgemini
Site Reliability Engineer
Production Support Engineer focused on Site Reliability Engineering, responsible for monitoring, alerting, incident response, and automation across cloud environments using AWS/GCP and Kubernetes.
About the role
Key Responsibilities
- Design, implement, and maintain monitoring and alerting solutions for production services.
- Lead incident investigations, root‑cause analysis, and post‑mortem documentation.
- Automate operational tasks and improve deployment pipelines using CI/CD tools.
- Collaborate with development and security teams to ensure high availability and reliability.
- Participate in on‑call rotations and provide 24/7 production support.
Requirements
- 3+ years of experience in SRE or production support roles.
- Strong knowledge of cloud platforms (AWS or GCP) and container orchestration (Kubernetes).
- Proficiency with monitoring tools (Prometheus, Grafana, Datadog) and alerting frameworks.
- Hands‑on scripting skills (Python, Bash) and familiarity with CI/CD pipelines.
- Excellent problem‑solving, communication, and teamwork abilities.