Basic Qualifications
Bachelor's degree in Software Engineering, or related Science, Technology, Engineering or Mathematics field, plus a minimum of 8 years of relevant experience; or Master's degree, plus 6 years relevant experience. CLEARANCE REQUIREMENTS: : Department of Defense Secret security clearance is required at time of hire. Applicants selected will be subject to a U.S. Government security investigation and must meet eligibility requirements for access to classified information. Due to the nature of work performed within our facilities, U.S. citizenship is required.
Responsibilities for this Position
What You'll Own
- SLOs and reliability metrics. Define service level objectives for every AI service that goes to production. Establish error budgets and use them to drive engineering decisions — not just measure uptime.
- Monitoring and observability. Build and maintain monitoring, logging, and alerting infrastructure for AI services. You will know when something is degrading before users do.
- Incident response. Establish incident management procedures, lead post-incident reviews, and drive corrective actions. When something breaks, you coordinate the response and ensure it doesn't break the same way again.
- Operational readiness reviews. Before any AI service goes live, you validate that it meets reliability, security, and operational standards. You are the gate between "it works in dev" and "it's ready for production."
- Capacity planning and cost monitoring. Track resource consumption, forecast capacity needs, and monitor costs — tokens, compute, storage. You ensure the platform scales without surprises.
- Toil elimination. Identify and automate repetitive operational tasks. If a human is doing something a script could do, you fix that.
What You Won't Own
- Application development or AI model building — you ensure what they build is operable, you don't build it
- Infrastructure provisioning — IT provides the infrastructure; you define what's needed and validate it works
- Business process decisions or backlog prioritization
What Makes This Role Different
- AI services have failure modes that traditional applications don't — model drift, token budget exhaustion, prompt injection, upstream data quality degradation. You will build monitoring for problems that most SRE teams have never encountered.
- You are applying SRE principles from scratch. There is no existing SRE practice to inherit — you will define it for the platform.
- Your operational readiness reviews directly determine whether AI services go live. You have real authority to say "not ready."
Required Qualifications
- Bachelor’s degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Mast