remote
Site Reliability Engineer II - C Hit
Site Reliability Engineer
Site Reliability Engineer II responsible for ensuring reliability, performance, and security of eSMD applications across cloud and on‑premises environments, using monitoring, automation, and incident‑response practices.
About the role
Key Responsibilities
- Continuously monitor the health, performance, and availability of applications and underlying infrastructure, using tools such as Prometheus, Grafana, and cloud‑native metrics.
- Design, implement, and maintain observability solutions—including dashboards, alerts, and centralized logging—to provide real‑time insight for the eSMD platform.
- Lead incident response activities, perform root‑cause analysis, and drive post‑mortem reviews to improve system resilience.
- Automate repetitive operational tasks and deployment pipelines using Python, Bash, and Infrastructure‑as‑Code tools like Terraform.
- Collaborate with development and security teams to enforce best practices for reliability, scalability, and compliance in both AWS and on‑premises environments.
Requirements
- 3+ years of experience in site reliability, systems engineering, or DevOps roles.
- Strong proficiency with Linux systems and scripting (Python or Bash).
- Hands‑on experience managing cloud services (AWS) and container orchestration platforms (Kubernetes).
- Practical knowledge of infrastructure automation tools such as Terraform or CloudFormation.
- Familiarity with monitoring and alerting stacks (Prometheus, Grafana, ELK/EFK) and incident‑management processes.
Skills
linuxawskubernetesterraformprometheuspython