Devops Engineer
Platform Engineer responsible for operating and scaling production infrastructure across multiple GKE clusters, ensuring high availability, autoscaling, and full observability for real‑time video and AI agent workloads.
About Us
100ms operates two product lines at scale: a real-time Live Video platform powering latency-sensitive, high-concurrency video experiences, and an AI Agents platform that automates complex patient access workflows in U.S. healthcare.
Both products run on a shared, robust infrastructure foundation. You'll be joining the central platform team responsible for keeping both running reliably, securely, and at scale — serving developers and healthcare operators who depend on us around the clock.
What Will You Do
Manage GitOps workflows using Argo CD for automated, version-controlled, and auditable deployments across both product lines.
Maintain and optimize monitoring & alerting stacks using Open Source Monitoring Tools — with product-specific SLOs for low-latency video (jitter, packet loss, stream health) and AI workflow reliability (task throughput, failure rates, retry queues).
Implement infrastructure as code using Terraform for GCP resources and helm chart for Kubernetes manifests, with a strong bias toward repeatability and auditability.
Support the unique infrastructure demands of real-time video — including media server scaling, WebRTC infrastructure, low-latency networking, and high-throughput data paths.
Support AI agent workloads — including LLM inference infrastructure, async task queues, and integration pipelines with external healthcare systems.
Lead or support incident response, cluster upgrades, and disaster recovery procedures across both platforms.
Own the security posture of our infrastructure — enforce least-privilege access controls, manage secrets hygiene, and drive security hardening across clusters and services.
Implement and maintain compliance-aligned controls relevant to healthcare data environments (e.g., encryption at rest/in transit, audit logging, network segmentation).
Collaborate with product and engineering teams to embed security early in the development lifecycle — shift-left on vulnerability scanning, dependency audits, and policy enforcement.
Who Can Apply
Computer Science / Engineering degree or equivalent practical experience.
Minimum 3 years of hands-on experience with Kubernetes in a production environment.
Strong knowledge of CI/CD pipelines and GitOps workflows using Argo CD or similar tools.
Proficient in infrastructure automation using Terraform and Helm.
Experience in managing open source monitoring and logging stacks (Prometheus, Loki, Grafana, Alertmanager etc).
Working knowledge of cloud security principles — IAM, network policies, pod security, RBAC, and secrets management.
Comfortable with Linux systems, shell scri
Posted June 21, 2026