Position Summary
We’re hiring a Senior MLOps Engineer with deep machine learning engineering experience to build and operate the production platform powering ML/LLM-driven healthcare workflows. You’ll design reliable, secure, and compliant systems for model development, evaluation, deployment, monitoring, and continuous improvement—working closely with ML, data, security, and product teams.
This role is ideal for someone who has shipped ML systems in production and is excited about LLM orchestration, RAG, evaluations, guardrails, and observability in a regulated environment.
Key responsibilities
MLOps & ML Platform
- Design and operate ML platforms that support end-to-end workflows: data ingestion, feature engineering, training, evaluation, deployment, and monitoring.
- Build and maintain CI/CD for ML (testing, packaging, versioning, reproducibility, automated rollbacks, approvals).
- Implement MLOps best practices: model registry, experiment tracking, lineage, governance, and reproducible training environments.
- Develop scalable training infrastructure (distributed training, GPU scheduling, cost controls, auto-scaling).
- Create and maintain feature pipelines / feature stores, ensuring consistency between training and inference (training-serving skew prevention).
- Establish model monitoring and observability: performance, drift, bias/fairness signals (where relevant), latency, throughput, and data quality.
- Build and own end-to-end LLM delivery pipelines: prompt/versioning, retrieval, orchestration, evaluation, deployment, monitoring, and iterative improvement.
- Create robust LLM evaluation harnesses (offline + online): golden datasets, automated regression testing, human-in-the-loop review workflows, and risk scoring.
- Build cost controls: token/cost budgeting, caching strategies, autoscaling, and performance tuning.
Deployment, reliability, and operations
- Productionize ML Models on GCP using containers and orchestration (e.g., GKE, Cloud Run), and build CI/CD for ML/LLM systems with automated tests and safe rollouts.
- Implement observability: tracing, metrics, logs, dashboards, alerting for model/system health (latency, token usage, error rates, retrieval quality, hallucination indicators, drift where relevant).
- Build cost controls: token/cost budgeting, caching strategies, autoscaling, and performance tuning.
Data, governance, and compliance (Healthcare)
- Design systems with security and privacy by default: IAM, least privilege, secrets management, audit logs, encryption, data retention, and PHI/PII handling.
- Implement governance: model/prompt lineage, dataset provenance, evaluation traceability, and approval workflows aligned with healthcare compliance expectations.
Integrate guardrails: cont