onsite
Site Reliability Engineer - TEKsystems c/o Allegis Group
Site Reliability Engineer
Site Reliability Engineer responsible for ensuring production stability, performance, and reliability of enterprise applications through end‑to‑end monitoring, observability, incident response, and SRE best practices, with a focus on Dynatrace APM.
About the role
Key Responsibilities
- Design, implement, and maintain monitoring and observability solutions using Dynatrace to provide real‑time insight into application health and performance.
- Develop and automate incident response processes, including alert routing, on‑call rotations, and post‑mortem analysis.
- Collaborate with engineering and product teams to embed SRE best practices into the software development lifecycle.
- Manage and optimize cloud infrastructure (AWS) and container orchestration platforms (Kubernetes) for high availability and scalability.
- Write and maintain automation scripts (Python, Bash) for deployment, configuration, and remediation tasks.
Requirements
- 3+ years of experience in site reliability, systems engineering, or a related role.
- Strong hands‑on experience with Dynatrace or comparable APM tools.
- Proficiency in Linux administration and scripting languages such as Python or Bash.
- Solid understanding of cloud services (AWS) and container orchestration (Kubernetes).
- Demonstrated ability to lead incident management, conduct root‑cause analysis, and drive continuous improvement.
Skills
linuxpythonawskubernetes