onsite
Principal Engineer - AI Ops - Wells Fargo
Systems Engineer
Lead the design and delivery of enterprise‑scale AI Ops solutions, driving zero‑touch production through advanced observability, automation, and event‑driven architectures to enhance resilience and reduce manual toil.
About the role
Key Responsibilities
- Architect and implement next‑generation AI Ops platforms that enable autonomous monitoring, incident response, and remediation across the enterprise.
- Define and execute a Zero Touch Production strategy, integrating AI/ML models, observability tooling, and automation pipelines.
- Collaborate with cross‑functional teams to translate business requirements into scalable, resilient event‑driven architectures.
- Lead the design of observability frameworks, including metrics, logs, traces, and alerting, ensuring end‑to‑end visibility.
- Mentor and guide engineering teams on best practices for AI Ops, platform reliability, and continuous improvement.
Requirements
- 10+ years of experience in platform engineering, with a focus on AI Ops, observability, and automation.
- Deep expertise in AI/ML model deployment, monitoring, and operationalization at scale.
- Proven track record designing event‑driven, microservices‑based architectures that support zero‑touch operations.
- Strong leadership skills, able to influence stakeholders and drive cross‑team collaboration.
- Excellent communication and problem‑solving abilities, with a passion for building resilient, self‑healing systems.
Skills
generative aillmlangchainpythonkubernetesterraformansibleprometheus