onsite

Software Engineer, Systems ML - Meta

Software Engineer

Lead the design and implementation of scalable AI infrastructure, blending cutting‑edge machine learning research with systems engineering to deliver high‑performance, production‑grade solutions across Meta’s product ecosystem.

About the role

Key Responsibilities

Architect and develop distributed training pipelines that optimize resource utilization and reduce training time for large‑scale models.
Design and implement model serving frameworks capable of handling millions of requests per second with low latency.
Collaborate with research scientists to translate novel ML algorithms into production‑ready systems, ensuring reproducibility and robustness.
Conduct performance profiling and bottleneck analysis, applying GPU programming (CUDA) and hardware‑software co‑design techniques to accelerate inference and training workloads.
Integrate and maintain cloud‑native infrastructure (e.g., Kubernetes, AWS) to support elastic scaling and fault tolerance.

Requirements

Strong proficiency in Python and C++ with experience in large‑scale distributed systems.
Hands‑on experience with GPU programming (CUDA) and deep learning frameworks (PyTorch, TensorFlow).
Deep understanding of distributed computing concepts, including data parallelism, model parallelism, and fault‑tolerant design.
Experience building production‑grade ML serving systems and optimizing for latency and throughput.
Excellent problem‑solving skills and a passion for bridging research and engineering.

Skills

pythonccudamachine learning

CompanyMeta

DepartmentEngineering

LocationBellevue, WA, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Salary257,000

Posted June 19, 2026