remote
Network Reliability Engineer - Cloudflare
Software Engineer
Drive the reliability of a global edge network, designing and automating monitoring, incident response, and capacity planning for high‑availability services using Python, Kubernetes, and Cloudflare’s proprietary tooling.
About the role
Key Responsibilities
- Design, implement, and maintain end‑to‑end monitoring and alerting for a multi‑region edge network.
- Lead incident investigations, root‑cause analysis, and post‑mortem documentation to improve system resilience.
- Collaborate with software, security, and infrastructure teams to automate reliability workflows and capacity planning.
- Develop and maintain tooling in Python and Go to surface network health metrics and drive proactive remediation.
- Participate in on‑call rotations, ensuring rapid response to outages and performance regressions.
Requirements
- 5+ years of experience in network operations or reliability engineering at a large scale.
- Strong knowledge of TCP/IP, BGP, DNS, and CDN architectures.
- Proficiency in Python and experience with Kubernetes or similar container orchestration.
- Hands‑on experience with monitoring systems (Prometheus, Grafana, or equivalent) and incident management tools.
- Excellent communication skills and a collaborative mindset.