remote
Site Reliability Engineer - Berkeley Research Group, LLC
Site Reliability Engineer
Join a health‑tech team as a Site Reliability Engineer, building and operating scalable, secure cloud infrastructure using AWS, Kubernetes, and automation tools while ensuring high availability and performance.
About the role
Key Responsibilities
- Design, implement, and maintain highly available services on AWS, leveraging Kubernetes for container orchestration.
- Develop automation scripts and infrastructure‑as‑code using Python and Terraform to streamline deployments and scaling.
- Monitor system health, performance, and reliability with observability tools, responding to incidents and performing root‑cause analysis.
- Collaborate with development and product teams to embed reliability best practices into the software development lifecycle.
- Continuously improve CI/CD pipelines, ensuring secure, repeatable releases and rapid rollback capabilities.
Requirements
- 3+ years of experience managing production Linux environments in a cloud setting.
- Proficiency with AWS services (EC2, RDS, S3, IAM) and container platforms such as Kubernetes.
- Strong scripting skills in Python and experience with Terraform or similar IaC tools.
- Hands‑on experience with monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK).
- Solid understanding of networking, security, and incident response processes.
Skills
linuxpythonkubernetesawsterraform