remote

Director of Cloud Operations - Firstup

Engineering Manager

Director of Cloud Operations leading scalable, secure cloud infrastructure for a global workforce platform, driving automation, CI/CD pipelines, and monitoring across AWS and Kubernetes environments to ensure high availability and performance.

About the role

Who We Are

At Firstup , our mission is to improve the employee experience at every moment that matters, large and small. As the communication pipeline for the world's workforce, we now serve 40 of the Fortune 100 companies, reaching and connecting more than 17 million employees daily.

Our employees are experts in the employee experience, workforce communications and technology.

Joining Firstup means joining a movement to make work better for every worker. As the world’s first intelligent communication platform, Firstup meaningfully engages employees at every moment from hire to retire, and delivers engagement insights to help companies support, promote and retain their talent. Our movement has taken root and is evident in our world-class customer base. Now we need your help. Ready to make a difference in the world?

Job Summary:

We are seeking a Director of Cloud Operations (CloudOps) to lead and evolve our cloud infrastructure and operational practices across a globally distributed SaaS platform. This is a hands-on leadership role responsible for ensuring the reliability, scalability, and efficiency of our systems running across multiple AWS regions in the United States and Europe.

As part of the senior leadership team, you will partner closely with Engineering, Security, and Product to strengthen operational excellence, enhance system observability, and drive continuous improvement in how we build and run services. You will lead a distributed team of engineers across the US and UK, fostering a high-performing, collaborative, and growth-oriented environment.

This role is ideal for a leader who combines deep technical expertise with a pragmatic approach to improving systems, processes, and team capabilities.

What You’ll Do

Cloud Platform & Reliability

Own the availability, performance, and resilience of our multi-region AWS platform.

Drive improvements in system reliability through well-defined SLIs/SLOs , error budgets, and proactive engineering practices.

Lead efforts to reduce MTTR and improve incident response effectiveness across the organization.

Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads to ensure scalability and fault tolerance.

Observability & Incident Management

Advance our observability strategy using Datadog , ensuring actionable insights across infrastructure and applications.

Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews.

Act as an incident commander for critical events and contribute to the on-call rotation.

Operational Excellence & Efficiency

Elevate operational standards through automation, standardization, and adoption of modern best practices.

Drive cost optimization initiatives acro

About the role

Who We Are

Our employees are experts in the employee experience, workforce communications and technology.

Job Summary:

This role is ideal for a leader who combines deep technical expertise with a pragmatic approach to improving systems, processes, and team capabilities.

What You’ll Do

Cloud Platform & Reliability

Own the availability, performance, and resilience of our multi-region AWS platform.

Drive improvements in system reliability through well-defined SLIs/SLOs , error budgets, and proactive engineering practices.

Lead efforts to reduce MTTR and improve incident response effectiveness across the organization.

Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads to ensure scalability and fault tolerance.

Observability & Incident Management

Advance our observability strategy using Datadog , ensuring actionable insights across infrastructure and applications.

Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews.

Act as an incident commander for critical events and contribute to the on-call rotation.

Operational Excellence & Efficiency

Elevate operational standards through automation, standardization, and adoption of modern best practices.

Drive cost optimization initiatives acro

Director of Cloud Operations - Firstup

About the role

Director of Cloud Operations - Firstup

About the role

Skills