remote
SRE & Production Management - Amaris Consulting
Site Reliability Engineer
SRE & Production Engineer responsible for ensuring system reliability, building observability tooling, automating deployments, and managing incident response across a global tech stack using AWS/GCP and modern monitoring solutions.
About the role
Key Responsibilities
- Design, build, and maintain production systems across hardware, software, application, and network layers.
- Own production support, including rotational follow‑the‑sun coverage and incident response.
- Develop and enhance observability tools, dashboards, and alerting to improve system visibility.
- Automate deployment pipelines, configuration management, and operational workflows using CI/CD and IaC.
- Collaborate with engineering and development teams to implement best practices for reliability and scalability.
Requirements
- Proven experience as an SRE or Production Engineer in a large, distributed environment.
- Strong knowledge of cloud platforms (AWS, GCP) and container orchestration (Kubernetes).
- Hands‑on expertise with monitoring/observability tools (Prometheus, Grafana, ELK, Datadog).
- Solid scripting skills (Python, Bash) and familiarity with IaC (Terraform, CloudFormation).
- Excellent problem‑solving, communication, and teamwork abilities.
Skills
pythonsqllinuxjenkinsprometheusgrafanapostgresqlmongodb