onsite
Site Reliability Engineer - Intapp
Site Reliability Engineer
Lead the design and operation of AI‑native infrastructure, automating incident response and anomaly detection to keep mission‑critical services running smoothly.
About the role
Key Responsibilities
- Architect, deploy, and maintain highly available cloud infrastructure that supports AI agents and data pipelines.
- Develop and refine automated incident response workflows, leveraging AI for anomaly detection and root‑cause analysis.
- Collaborate with DevOps, security, and product teams to implement observability, monitoring, and alerting best practices.
- Drive continuous improvement of reliability metrics, reducing mean time to recovery and toil across the platform.
- Mentor and guide junior engineers on SRE principles, cloud operations, and AI‑powered tooling.
Requirements
- 5+ years of experience in site reliability engineering or cloud operations.
- Proficiency with Kubernetes, container orchestration, and cloud platforms (AWS, GCP, or Azure).
- Hands‑on experience building AI/ML pipelines for monitoring, anomaly detection, or incident automation.
- Strong scripting skills (Python, Bash) and familiarity with CI/CD pipelines.
- Excellent problem‑solving skills and a passion for building resilient, scalable systems.