remote

Site Reliability Engineer - Berkeley Research Group, LLC

Site Reliability Engineer

Join a health‑tech team as a Site Reliability Engineer, building and operating scalable, secure cloud infrastructure using AWS, Kubernetes, and automation tools while ensuring high availability and performance.

About the role

Key Responsibilities

Design, implement, and maintain highly available services on AWS, leveraging Kubernetes for container orchestration.
Develop automation scripts and infrastructure‑as‑code using Python and Terraform to streamline deployments and scaling.
Monitor system health, performance, and reliability with observability tools, responding to incidents and performing root‑cause analysis.
Collaborate with development and product teams to embed reliability best practices into the software development lifecycle.
Continuously improve CI/CD pipelines, ensuring secure, repeatable releases and rapid rollback capabilities.

Requirements

3+ years of experience managing production Linux environments in a cloud setting.
Proficiency with AWS services (EC2, RDS, S3, IAM) and container platforms such as Kubernetes.
Strong scripting skills in Python and experience with Terraform or similar IaC tools.
Hands‑on experience with monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK).
Solid understanding of networking, security, and incident response processes.

Skills

linuxpythonkubernetesawsterraform

CompanyBerkeley Research Group, LLC

DepartmentEngineering

LocationUnited States

Experience3+ years

Tenurefull-time

LevelMid-Level

Salary160,000

Posted June 26, 2026