remote
Senior Technology Operations & Reliability Manager - ClearCaptions, LLC
Systems Engineer
Lead the reliability and operations of a cloud‑native, real‑time captioning platform, driving automation, incident response, and performance monitoring using AWS, CI/CD pipelines, and SRE best practices.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS to support real‑time speech‑to‑text services.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system reliability.
- Develop and automate deployment pipelines (CI/CD) and configuration management to accelerate feature delivery.
- Implement monitoring, alerting, and observability solutions (e.g., Prometheus, Grafana, CloudWatch) to ensure service health and performance.
- Collaborate with engineering, product, and support teams to define SRE standards, SLAs, and error budgets.
Requirements
- 5+ years of experience in site reliability, DevOps, or cloud operations, preferably in a SaaS environment.
- Strong expertise with AWS services (EC2, RDS, Lambda, S3, CloudFormation/Terraform) and infrastructure‑as‑code.
- Proven track record of building automated CI/CD pipelines and implementing robust monitoring/alerting frameworks.
- Hands‑on experience with incident management, root‑cause analysis, and driving continuous improvement.
- Excellent communication skills and ability to lead cross‑functional teams in a fully remote setting.