remote
Partner 20, Staff Engineer, Incident Response - a16z crypto
Site Reliability Engineer
Lead Site Reliability Engineer to design resilient systems and lead incident response initiatives.
About the role
Key Responsibilities
- Lead incident response and post-mortem analysis for critical systems
- Design and implement scalable reliability solutions
- Automate infrastructure and deployment pipelines
- Monitor system performance and proactively address issues
- Collaborate with engineering teams to improve system resilience
- Document incident response protocols and best practices
Requirements
- 5+ years of experience in site reliability or incident response
- Expertise in Kubernetes, Docker, and cloud infrastructure
- Strong scripting and automation skills
- Experience with Terraform and infrastructure-as-code
- Excellent troubleshooting and communication skills
Skills
incident responsesrekubernetesdockerterraformcloud infrastructure