onsite

Senior Site Reliability Engineer (AI)

OneTrust is seeking a Senior Site Reliability Engineer (AI) to own production services end-to-end, ensuring reliability, scalability, and operational excellence. This role involves partnering with engineering, operations, and product teams to design, deliver, and maintain a highly available and performant application platform, with a focus on AI-assisted incident response and machine learning techniques for reliability.

About the role

Strength in Trust

OneTrust’s mission is to enable innovation through the responsible use of data and AI. We believe that ensuring data is trusted shouldn’t slow teams down—it should accelerate what’s possible. Today, with AI representing the latest and most impactful expansion of data yet, OneTrust is once again redefining what responsible innovation looks like. OneTrust, the AI‑Ready Governance Platform™, unifies regulatory intelligence, automation, and connected governance workflows so businesses can continue to move at the speed of AI while ensuring good governance to prevent data misuse at scale. Trusted by thousands of organizations worldwide, OneTrust is shaping the future where trusted data becomes a transformative force for business and society.

The Challenge

Own production services end-to-end, including reliability, scalability, and operational excellence
Participate in on-call rotation and lead incident response

Your Mission

Engage and partner with various Engineering, Operations, and Product teams to design, deliver, and maintain a highly available and performant application platform.

Collaborate with different functional groups to identify gaps, prioritize, and resolve issues
Defining, implementing, and maintaining SLIs and SLOs aligned with customer experience.
Design and instrument SLIs such as latency, error rates, and availability across critical services
Manage and enforce error budgets to balance system reliability with product feature velocity.
Improving alert quality by reducing noise and focusing on actionable, high-signal alerts
Embed with product teams to review architectures and catch reliability risks early
Share your knowledge and experience with the Engineering organization
Share your findings with technical leadership and senior management
Build scripts in python/bash/java or ruby for operational automation and incident response

You Are

A hands-on engineer familiar with running production services and providing understanding and solutions to appropriately monitor and automate those services.

Your Experience Includes

Bachelor's degree in computer science, Engineering, or related technical or business field
4+ yrs. of application development experience with Java or other equivalent language
Experience with Spring environment.
Experience in cloud-based infrastructure (Azure, AWS, GCP, etc.)
Experience with the factors influencing performance of software applications at multiple layers (Database, network, CPU utilization, JVM tuning, memory analysis, thread management, query performance etc.)
An understanding of the importance of centralizing logging, metrics dashboards, and alerting. Able to talk about some of the tools used for these tasks
A good understanding of databases (ideally SQL/NoSQL)
Hands-on experience with observability tools (Datadog, Prometheus, Grafana, etc.)
Familiarity with CI/CD pipelines and infrastructure-as-code (Terraform, Helm, jenkins, gitlab)
Build and operate AI-assisted incident response systems (root cause analysis, log summarization, anomaly triage)
Develop or integrate LLM-based tools to reduce MTTR and improve alert quality
Apply machine learning techniques for anomaly detection, capacity prediction, or failure pattern analysis
Experience deploying AI systems in production (not just experimentation)
Familiarity with vector databases, embeddings, or RAG architectures for operational intelligence
Strong understanding of prompt engineering and evaluation of LLM outputs in reliability workflow
Kubernetes and container orchestration (EKS/AKS/GKE)
Experience with distributed systems at scale
Familiarity with service meshes and microservices architecture

Nice to Have

Experience with chaos engineering tools (Gremlin, Chaos Monkey)
Background in product-facing services with high traffic scale
Knowledge of incident management platforms (PagerDuty/DataDog alerts)

About the role

Strength in Trust

The Challenge

Own production services end-to-end, including reliability, scalability, and operational excellence
Participate in on-call rotation and lead incident response

Your Mission

Engage and partner with various Engineering, Operations, and Product teams to design, deliver, and maintain a highly available and performant application platform.

Collaborate with different functional groups to identify gaps, prioritize, and resolve issues
Defining, implementing, and maintaining SLIs and SLOs aligned with customer experience.
Design and instrument SLIs such as latency, error rates, and availability across critical services
Manage and enforce error budgets to balance system reliability with product feature velocity.
Improving alert quality by reducing noise and focusing on actionable, high-signal alerts
Embed with product teams to review architectures and catch reliability risks early
Share your knowledge and experience with the Engineering organization
Share your findings with technical leadership and senior management
Build scripts in python/bash/java or ruby for operational automation and incident response

You Are

A hands-on engineer familiar with running production services and providing understanding and solutions to appropriately monitor and automate those services.

Your Experience Includes

Bachelor's degree in computer science, Engineering, or related technical or business field
4+ yrs. of application development experience with Java or other equivalent language
Experience with Spring environment.
Experience in cloud-based infrastructure (Azure, AWS, GCP, etc.)
Experience with the factors influencing performance of software applications at multiple layers (Database, network, CPU utilization, JVM tuning, memory analysis, thread management, query performance etc.)
An understanding of the importance of centralizing logging, metrics dashboards, and alerting. Able to talk about some of the tools used for these tasks
A good understanding of databases (ideally SQL/NoSQL)
Hands-on experience with observability tools (Datadog, Prometheus, Grafana, etc.)
Familiarity with CI/CD pipelines and infrastructure-as-code (Terraform, Helm, jenkins, gitlab)
Build and operate AI-assisted incident response systems (root cause analysis, log summarization, anomaly triage)
Develop or integrate LLM-based tools to reduce MTTR and improve alert quality
Apply machine learning techniques for anomaly detection, capacity prediction, or failure pattern analysis
Experience deploying AI systems in production (not just experimentation)
Familiarity with vector databases, embeddings, or RAG architectures for operational intelligence
Strong understanding of prompt engineering and evaluation of LLM outputs in reliability workflow
Kubernetes and container orchestration (EKS/AKS/GKE)
Experience with distributed systems at scale
Familiarity with service meshes and microservices architecture

Nice to Have

Experience with chaos engineering tools (Gremlin, Chaos Monkey)
Background in product-facing services with high traffic scale
Knowledge of incident management platforms (PagerDuty/DataDog alerts)

Senior Site Reliability Engineer (AI)

About the role

Strength in Trust

The Challenge

Your Mission

You Are

Your Experience Includes

Nice to Have

Senior Site Reliability Engineer (AI)

About the role

Strength in Trust

The Challenge

Your Mission

You Are

Your Experience Includes

Nice to Have

Skills