onsite
Site Reliability Engineer - TEKsystems
Site Reliability Engineer
Site Reliability Engineer responsible for ensuring production stability and performance of enterprise applications through end‑to‑end monitoring, Dynatrace APM, incident response, and automation on Linux and cloud platforms.
About the role
Key Responsibilities
- Design, implement, and maintain end‑to‑end monitoring and observability solutions, with a focus on Dynatrace APM.
- Own incident management lifecycle: detection, triage, root‑cause analysis, and post‑mortem documentation.
- Collaborate with engineering and product teams to define service level objectives (SLOs) and reliability targets.
- Automate operational tasks using Python or Bash scripts to improve reliability and reduce manual toil.
- Manage and optimize Linux‑based production environments, including cloud resources on AWS.
Requirements
- 3+ years of experience in site reliability, production support, or DevOps roles.
- Strong hands‑on experience with Dynatrace or similar APM tools.
- Proficiency in Linux system administration and scripting (Python, Bash).
- Solid understanding of monitoring, observability, and incident response best practices.
- Experience working with cloud platforms, preferably AWS, and infrastructure‑as‑code concepts.