onsite
Lead Site Reliability Engineer - Relativity
Site Reliability Engineer
Lead the reliability and performance of RelativityOne, driving SRE best practices across services, monitoring, incident response, and automation using Kubernetes, Docker, AWS, and observability tools.
About the role
Key Responsibilities
- Own end‑to‑end reliability of core platform services, ensuring high availability and fault tolerance.
- Design, implement, and maintain scalable monitoring, alerting, and incident response workflows with Prometheus, Grafana, and PagerDuty.
- Lead automation of deployment pipelines and configuration management using CI/CD tools and infrastructure as code.
- Collaborate with development teams to embed SRE principles into feature design and code reviews.
- Drive capacity planning, performance tuning, and cost optimization across AWS infrastructure.
Requirements
- 5+ years of SRE or DevOps experience in a large, distributed system environment.
- Proficiency with Kubernetes, Docker, and cloud platforms (AWS preferred).
- Strong scripting skills (Python, Bash) and experience with CI/CD pipelines.
- Excellent incident management, root cause analysis, and post‑mortem practices.
- Effective communication and mentorship abilities for cross‑functional teams.
Skills
kubernetesdockerawsprometheusgrafanacicd