onsite
AI Operations & Infrastructure Engineer - Invictus International Consulting, LLC
Devops Engineer
Lead the design, deployment, and optimization of AI computing platforms, managing GPU hardware, container orchestration, high‑speed networking, and storage to deliver scalable, high‑performance AI workloads.
About the role
Key Responsibilities
- Design, deploy, and maintain AI computing platforms, including GPU clusters and specialized hardware accelerators.
- Install, configure, and update GPU drivers, CUDA toolkits, and related software stacks.
- Implement containerization with Docker and orchestrate workloads using Kubernetes for scalable AI services.
- Configure and optimize high‑speed networking (InfiniBand, Ethernet) to support low‑latency AI data pipelines.
- Manage storage solutions, balancing performance, capacity, and data durability for large AI datasets.
- Deploy and monitor DPUs to accelerate data center workloads and improve throughput.
Requirements
- Proven experience managing GPU‑based AI infrastructure and containerized environments.
- Strong knowledge of Kubernetes, Docker, and CI/CD pipelines for AI workloads.
- Hands‑on expertise with InfiniBand, Ethernet, and high‑performance networking protocols.
- Experience with storage technologies (NVMe, SSD, HDD) and performance tuning for AI data.
- Excellent troubleshooting skills and ability to optimize complex, distributed systems.