remote
AI Artificial Intelligence Infrastructure Architect - Lawrence Livermore National Laboratory
Software Engineer
Lead the design and deployment of scalable AI infrastructure, leveraging Kubernetes, Docker, and cloud services to support advanced machine learning workloads across high‑performance computing environments.
About the role
Key Responsibilities
- Architect and implement end‑to‑end AI infrastructure solutions, integrating container orchestration (Kubernetes) and CI/CD pipelines for rapid model deployment.
- Collaborate with data scientists and software engineers to optimize ML workflows, ensuring high availability, performance, and security across distributed systems.
- Design and maintain scalable cloud and on‑premise environments (AWS, HPC clusters), including networking, storage, and compute resource provisioning.
- Develop automation scripts (Python) for infrastructure provisioning, monitoring, and cost management.
- Lead troubleshooting and root‑cause analysis for production AI workloads, implementing best practices for resilience and observability.
Requirements
- 5+ years of experience in AI/ML infrastructure engineering or related roles.
- Strong knowledge of cloud platforms (AWS preferred) and high‑performance computing environments.
- Experience with CI/CD tools (GitLab CI, Jenkins) and infrastructure as code (Terraform, Ansible).
- Excellent problem‑solving skills and ability to work collaboratively in a multidisciplinary team.
Skills
pythonkubernetesdockermachine learningaws