onsite

Lead Site Reliability Engineer - TekChronicles

Site Reliability Engineer

Lead SRE responsible for defining reliability goals, building automation, and improving observability for mission‑critical risk technology applications using Kubernetes, Terraform, and cloud services.

About the role

Key Responsibilities

Define and drive SRE objectives, including SLAs, SLOs, SLIs, and error‑budget management for business‑critical services.
Design, implement, and maintain highly available infrastructure on AWS using Kubernetes, Terraform, and IaC best practices.
Develop and enhance monitoring, alerting, and observability pipelines with Prometheus, Grafana, and custom instrumentation.
Automate deployment, scaling, and incident response workflows through CI/CD pipelines and scripting (Python/Go).
Collaborate with development, support, and security teams to embed reliability and resiliency into the software development lifecycle.

Requirements

5+ years of hands‑on SRE or DevOps experience in large‑scale, cloud‑native environments.
Deep expertise with Kubernetes orchestration, Terraform, and AWS services.
Proven track record building observability stacks (Prometheus, Grafana) and managing error budgets.
Strong programming/scripting skills in Python (or Go) for automation and tooling.
Experience establishing and enforcing reliability standards, incident management processes, and continuous improvement practices.

Skills

kubernetesterraformprometheusgrafanapythonawscicd

CompanyTekChronicles

DepartmentEngineering

LocationJersey City, New Jersey, United States

Experience9+ years

Tenurefull-time

LevelLead

Posted June 24, 2026