onsite
Software Engineer, Systems ML - Meta
Software Engineer
Lead the design and implementation of scalable AI infrastructure, blending cutting‑edge machine learning research with systems engineering to deliver high‑performance, production‑grade solutions across Meta’s product ecosystem.
About the role
Key Responsibilities
- Architect and develop distributed training pipelines that optimize resource utilization and reduce training time for large‑scale models.
- Design and implement model serving frameworks capable of handling millions of requests per second with low latency.
- Collaborate with research scientists to translate novel ML algorithms into production‑ready systems, ensuring reproducibility and robustness.
- Conduct performance profiling and bottleneck analysis, applying GPU programming (CUDA) and hardware‑software co‑design techniques to accelerate inference and training workloads.
- Integrate and maintain cloud‑native infrastructure (e.g., Kubernetes, AWS) to support elastic scaling and fault tolerance.
Requirements
- Strong proficiency in Python and C++ with experience in large‑scale distributed systems.
- Hands‑on experience with GPU programming (CUDA) and deep learning frameworks (PyTorch, TensorFlow).
- Deep understanding of distributed computing concepts, including data parallelism, model parallelism, and fault‑tolerant design.
- Experience building production‑grade ML serving systems and optimizing for latency and throughput.
- Excellent problem‑solving skills and a passion for bridging research and engineering.
Skills
pythonccudamachine learning