remote
Senior Solutions Architect - AI Factory Observability & Visualization - NVIDIA
Solutions Architect
Lead end‑to‑end observability for high‑performance computing and AI factories, turning complex telemetry into actionable insights using Python, Node.js, Prometheus, Grafana, and Kubernetes.
About the role
Key Responsibilities
- Design and implement comprehensive observability solutions for HPC and AI factory environments, integrating telemetry from network, compute, and storage layers.
- Develop and run microbenchmarks and AI workloads to validate system readiness and performance baselines.
- Collaborate with cross‑functional teams to translate raw data into intuitive dashboards and alerts that enable proactive issue detection.
- Architect scalable data pipelines using Python and Node.js, leveraging Prometheus for metrics collection and Grafana for visualization.
- Ensure observability solutions are production‑ready, secure, and compliant with industry best practices.
Requirements
- 5+ years of experience in observability, monitoring, or performance engineering for HPC or AI systems.
- Proficiency in Python, Node.js, and container orchestration with Kubernetes.
- Hands‑on experience with Prometheus, Grafana, and related telemetry tooling.
- Strong analytical skills and ability to translate complex telemetry into actionable insights.
- Excellent communication skills and a collaborative mindset.
Skills
pythonnodejsprometheusgrafanakubernetes