onsite
principal Engineer-Site Reliability Engineering and AIOps - Wells Fargo
Software Engineer
Lead enterprise‑wide Site Reliability Engineering and AIOps initiatives, defining reliability strategy, reference architectures, and automation standards to embed resilience across a large application portfolio.
About the role
Key Responsibilities
- Architect and implement enterprise‑scale SRE and AIOps frameworks, including SLOs, error budgets, and incident response playbooks.
- Drive full‑stack observability across multiple lines of business, selecting and integrating monitoring, tracing, and log analytics tools.
- Lead cross‑functional teams to embed reliability best practices into the software delivery lifecycle and operating model.
- Develop and maintain reference architectures, engineering standards, and automation pipelines to accelerate reliability improvements.
- Mentor and coach engineering teams on SRE principles, incident management, and continuous improvement.
Requirements
- 10+ years of experience in large‑scale distributed systems, with deep expertise in SRE and AIOps.
- Proven track record designing and deploying observability, incident response, and automation solutions at enterprise scale.
- Strong knowledge of cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Excellent communication skills and ability to influence stakeholders across multiple business units.
- Experience with IaC, CI/CD, and modern monitoring/alerting stacks (Prometheus, Grafana, ELK, etc.).
Skills
pythonjavaansiblelinuxprometheusgrafanasplunkagile