onsite
Systems Engineer, HPC US & Canada - Mistral AI
Systems Engineer
Lead the design, deployment, and optimization of large‑scale HPC infrastructure across cloud and on‑prem environments, ensuring high availability, performance, and scalability for AI workloads.
About the role
Key Responsibilities
- Architect and maintain petabyte‑scale HPC clusters, integrating cloud services and on‑prem resources to support AI research and production workloads.
- Develop and automate deployment pipelines using Python, Bash, and configuration management tools for rapid provisioning and scaling.
- Collaborate with software and research teams to optimize performance, troubleshoot bottlenecks, and implement best practices for distributed training and inference.
- Monitor system health, capacity, and security, implementing proactive measures to ensure reliability and compliance.
- Document infrastructure designs, operational procedures, and knowledge base articles for internal use.
Requirements
- 5+ years of experience in HPC or large‑scale distributed systems engineering.
- Proficiency with Linux system administration, Python scripting, and C++ performance tuning.
- Hands‑on experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Slurm).
- Strong understanding of networking, storage, and security in high‑performance environments.
- Excellent problem‑solving skills and a collaborative mindset.