onsite
Lead Site Reliability Engineer - TekChronicles
Site Reliability Engineer
Lead SRE responsible for defining reliability goals, building automation, and improving observability for mission‑critical risk technology applications using Kubernetes, Terraform, and cloud services.
About the role
Key Responsibilities
- Define and drive SRE objectives, including SLAs, SLOs, SLIs, and error‑budget management for business‑critical services.
- Design, implement, and maintain highly available infrastructure on AWS using Kubernetes, Terraform, and IaC best practices.
- Develop and enhance monitoring, alerting, and observability pipelines with Prometheus, Grafana, and custom instrumentation.
- Automate deployment, scaling, and incident response workflows through CI/CD pipelines and scripting (Python/Go).
- Collaborate with development, support, and security teams to embed reliability and resiliency into the software development lifecycle.
Requirements
- 5+ years of hands‑on SRE or DevOps experience in large‑scale, cloud‑native environments.
- Deep expertise with Kubernetes orchestration, Terraform, and AWS services.
- Proven track record building observability stacks (Prometheus, Grafana) and managing error budgets.
- Strong programming/scripting skills in Python (or Go) for automation and tooling.
- Experience establishing and enforcing reliability standards, incident management processes, and continuous improvement practices.
Skills
kubernetesterraformprometheusgrafanapythonawscicd