onsite
Site Reliability Engineer, AI & Agentic Systems - ServiceLink
Site Reliability Engineer
Site Reliability Engineer focused on AI‑driven automation and performance engineering, ensuring scalable, resilient platforms using Azure AI services and agentic systems to reduce toil and enhance incident response.
About the role
Key Responsibilities
- Own end‑to‑end reliability of production systems, diagnosing and resolving incidents with minimal downtime.
- Design and implement AI‑powered automation using Azure‑native AI services to streamline operations and reduce manual toil.
- Develop and run performance and load tests to validate system resilience under peak conditions.
- Collaborate with development and security teams to embed reliability best practices into CI/CD pipelines.
- Document runbooks, post‑mortems, and knowledge bases to improve team knowledge and incident handling.
Requirements
- 3+ years of SRE or DevOps experience in cloud environments, preferably Azure.
- Hands‑on experience with Azure AI services (e.g., Azure Cognitive Services, Azure Machine Learning).
- Strong scripting skills (Python, PowerShell) and familiarity with IaC tools (Terraform, ARM).
- Proficiency in performance testing tools (JMeter, k6, Locust) and monitoring platforms (Azure Monitor, Grafana).
- Excellent problem‑solving skills, with a track record of automating repetitive tasks and improving incident response.