remote
Senior Software Engineer / SRE - itD Tech
Site Reliability Engineer
Lead the design, development and operation of large‑scale observability systems, ensuring high availability and performance of services using Python, Go, Kubernetes, Prometheus, Grafana, AWS and Terraform.
About the role
Key Responsibilities
- Architect and implement scalable observability solutions across distributed services.
- Lead incident response, root cause analysis, and post‑mortem processes to improve reliability.
- Collaborate with development teams to embed monitoring, logging and alerting into CI/CD pipelines.
- Automate infrastructure and observability tooling using Terraform, Kubernetes and cloud native services.
- Mentor junior engineers and drive best practices for SRE culture and tooling.
Requirements
- 5+ years of experience in software engineering with a focus on reliability and observability.
Skills
pythongokubernetesprometheusgrafanaawsterraform