About the Role
Epic Games is looking for a Senior Staff SRE to join our team. In this role, you will be instrumental in ensuring the reliability, scalability, and performance of our critical systems and services. You will work closely with development teams, providing expertise in system architecture, automation, and operational best practices.
What You'll Do
- Design, implement, and maintain highly available and scalable infrastructure in various cloud environments (AWS, Azure, GCP).
- Develop and manage automation tools and frameworks using technologies like Kubernetes, Terraform, and Ansible.
- Write clean, efficient, and well-documented code in languages such as Python, Go, C#, or Java, and strong Bash/shell scripting.
- Proactively identify and address performance bottlenecks, reliability issues, and security vulnerabilities.
- Implement and manage robust monitoring, alerting, and observability solutions to ensure system health.
- Lead incident response efforts, conduct thorough postmortems, and implement preventive measures.
- Collaborate with cross-functional teams to define and implement service level objectives (SLOs) and service level indicators (SLIs).
- Mentor junior SREs and contribute to a culture of continuous learning and improvement.
What We're Looking For
- Significant experience as a Site Reliability Engineer or in a similar role focusing on large-scale distributed systems.
- Strong expertise in cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Proficiency in infrastructure as code (Terraform) and configuration management (Ansible).
- Excellent programming skills in at least one of Python, Go, C#, or Java, along with strong Bash/shell scripting abilities.
- Deep understanding of Linux operating systems, networking (load balancing, DNS, TCP/IP), and distributed system concepts.
- Experience with various data stores (relational, NoSQL, caching).
- Proven track record of designing and implementing effective monitoring, alerting, and observability solutions.
- Demonstrated ability to lead incident response, perform root cause analysis, and implement long-term solutions.
- Experience with disaster recovery planning and implementation.
- Strong communication and collaboration skills.