onsite
Datacenter Hardware Operations Technician Lead, Industrial Compute - OpenAI
Systems Engineer
Lead a team of hardware technicians to maintain and optimize large‑scale industrial compute datacenters, ensuring uptime, reliability, and efficient lifecycle management using advanced monitoring, networking, and power‑cooling expertise.
About the role
Key Responsibilities
- Lead, mentor, and schedule a team of hardware technicians across multiple industrial compute sites.
- Oversee installation, configuration, and decommissioning of servers, storage, networking, and power/cooling infrastructure.
- Diagnose and resolve hardware failures, performance bottlenecks, and environmental issues using diagnostic tools and root‑cause analysis.
- Collaborate with Data Center Operations to align capacity planning, asset tracking, and preventive maintenance schedules.
- Implement and maintain automated monitoring, alerting, and reporting workflows to ensure proactive incident response.
- Document procedures, create knowledge base articles, and conduct training sessions for new hires and cross‑functional teams.
Requirements
- 5+ years of hands‑on datacenter hardware operations experience, preferably in large‑scale industrial compute environments.
- Strong knowledge of server, storage, networking, power, and cooling systems, with proven troubleshooting skills.
- Experience with configuration management (e.g., Ansible, Puppet) and monitoring tools (e.g., Prometheus, Grafana).
- Excellent communication, leadership, and documentation abilities.
- Ability to work flexible hours, including on‑call rotations, to support 24/7 operations.
Skills
process improvementproject managementoperations management