remote
Additional Verification Required - Site Reliability Engineer (SRE)
Site Reliability Engineer
Senior Site Reliability Engineer to define SLOs, manage error budgets, and implement chaos engineering practices for resilient production systems.
About the role
Key Responsibilities
- Define and operationalize Service Level Objectives (SLOs) and error budgets in production environments
- Design and implement reliability-focused monitoring, alerting, and incident response systems
- Conduct chaos engineering experiments to identify and mitigate system vulnerabilities
- Collaborate with development teams to improve system resilience and observability
- Optimize infrastructure performance and cost-efficiency through automation and best practices
- Participate in on-call rotations and post-mortem analysis to drive continuous improvement
Requirements
- 3+ years of experience in site reliability engineering or production operations
- Hands-on experience with SLOs, error budgets, and reliability metrics
- Familiarity with chaos engineering tools and methodologies
- Strong scripting and automation skills (Python, Bash, etc.)
- Experience with cloud platforms (AWS, GCP, Azure) and container orchestration
Skills
sloserror budgetschaos engineeringproduction operationssystem reliability