remoteonsite
SENIOR SITE RELIABILITY ENGINEER - Svitla Systems
Site Reliability Engineer
Senior Site Reliability Engineer responsible for automating incident routing, managing uptime metrics, and ensuring 24/7 reliability of a large online marketplace using Kubernetes, Prometheus, Grafana, and AWS.
About the role
Key Responsibilities
- Design and implement automated incident routing to ensure rapid response across multiple teams.
- Own and improve key reliability metrics such as MTTD, MTTR, and uptime for a 24/7 marketplace.
- Develop and maintain observability stack using Prometheus, Grafana, and custom dashboards.
- Collaborate with development and operations to embed reliability best practices into CI/CD pipelines.
- Lead post‑mortem analysis and drive continuous improvement initiatives.
Requirements
- 5+ years of experience in site reliability or DevOps roles.
- Proficiency with Kubernetes, container orchestration, and cloud platforms (AWS preferred).
- Strong scripting skills (Python, Bash) and experience with monitoring/alerting tools.
- Excellent incident management and communication skills.
- Experience with CI/CD tooling and automated deployment pipelines.
Skills
kubernetesprometheusgrafanaaws