remote
Engineering Manager, Reliability Platform - Affirm
Engineering Manager
Lead a Reliability Platform team to design and operate foundational observability and risk‑reduction tools, driving scalable reliability practices across production systems using Python, AWS, and modern SRE methodologies.
About the role
Key Responsibilities
- Lead and mentor a cross‑functional reliability engineering team, setting vision, goals, and career development paths.
- Design and ship platform services that provide operational intelligence, automated risk scoring, and incident response tooling for production environments.
- Collaborate with product, infrastructure, and security teams to define reliability standards, SLAs, and error‑budget policies.
- Drive adoption of observability best practices, including metrics, tracing, logging, and alerting pipelines on AWS.
- Own the incident lifecycle: detection, response, post‑mortem analysis, and continuous improvement of reliability processes.
Requirements
- 5+ years of software engineering experience with a strong focus on site reliability, performance, or infrastructure engineering.
- Proven track record leading technical teams and delivering large‑scale reliability platforms.
- Hands‑on expertise in Python (or comparable language) and cloud services, preferably AWS.
- Deep understanding of observability stacks, incident management, and reliability metrics.
- Excellent communication skills and ability to influence stakeholders across the organization.