remote
Kubernetes Software Engineer - Cadre5
Software Engineer
Senior engineer to design, deploy, and maintain Kubernetes-based HPC workloads on the OLCF Slate Service, ensuring high availability, performance, and security for critical scientific applications.
About the role
Key Responsibilities
- Design, implement, and manage Kubernetes clusters (RKE2) for high‑performance computing workloads on the OLCF Slate Service.
- Collaborate with HPC platform teams to integrate scientific applications, ensuring optimal resource utilization and scalability.
- Develop automation scripts and CI/CD pipelines to streamline deployment, monitoring, and troubleshooting of containerized services.
- Implement security best practices, including role‑based access control, network policies, and compliance with institutional standards.
- Provide technical support and performance tuning for production workloads, working closely with users and system administrators.
Requirements
- 5+ years of experience with Kubernetes and container orchestration in a production environment.
- Hands‑on expertise with RKE2, Helm, and related tooling.
- Strong background in HPC concepts, workload scheduling, and performance optimization.
- Proficiency in scripting (Python, Bash) and automation frameworks.
- Excellent problem‑solving skills and ability to work collaboratively in a multidisciplinary team.