Site Reliability Engineer
Platform SRE Engineer at GE Vernova responsible for provisioning, hardening, and maintaining EKS Kubernetes clusters in AWS, ensuring secure, scalable, and resilient infrastructure for global grid SaaS products through lifecycle management, performance tuning, and patching.
Job Description Summary
Job Description
Roles and Responsibilities
Day 0: Provision & Infrastructure Hardening
Kubernetes Cluster Orchestration: Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent security baselines.
Infrastructure as Code (IaC): Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc.
Security Architecture: Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards.
Operationalize Cloud Infrastructure: Standardize run books, operating processes required to run critical infrastructure with highest reliability.
Day 1: Platform Readiness & Scaling
Resource Governance: Define and enforce Kubernetes resource quotas, limit ranges, and Pod Priority classes to ensure mission-critical services receive prioritized compute resources.
Connectivity & Ingress: Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed micro services.
Acceptance Testing: Lead platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets.
Sizing & Optimization: Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps).
Day 2: Operational Excellence & Tier 3 Support
L3 Escalation: Act as the highest technical escalation point for complex Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions.
Incident Response: Lead root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures.
Toil Elimination: Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements.
Observability Integration: Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health.
Technical Requirements
Kubernetes: 5 years of experience operating production-grade Kubernetes clusters at scale.
Orchestration & Observability Tools: Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, Dynatrace, Splunk, Datadog, etc.
AWS Infrastructure: Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK).
Automation Stack: Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or
Posted June 21, 2026