onsite

Senior Software Engineer - AI/ML, AWS Neuron Distributed Training - Amazon

Software Engineer

Lead the design and implementation of distributed AI/ML workloads on AWS Neuron, optimizing performance across Trainium and Inferentia hardware to accelerate customer solutions.

About the role

Key Responsibilities

Architect and develop scalable distributed training pipelines using AWS Neuron on Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) hardware.
Collaborate with cross‑functional teams to integrate custom silicon accelerators into cloud‑native ML workflows.
Optimize model performance, memory usage, and inference latency through code and hardware tuning.
Implement monitoring, profiling, and debugging tools to ensure reliability and efficiency at scale.
Contribute to open‑source Neuron SDK enhancements and provide technical guidance to internal stakeholders.

Requirements

5+ years of software engineering experience in AI/ML, with a strong focus on distributed training.
Proficiency in Python and experience with deep learning frameworks (PyTorch, TensorFlow).
Hands‑on experience with AWS services (SageMaker, EC2, EKS) and familiarity with Neuron SDK.
Solid understanding of distributed systems, parallel computing, and performance optimization.
Excellent problem‑solving skills and a passion for pushing the boundaries of ML hardware acceleration.

Skills

pythonawsmachine learning

CompanyAmazon

DepartmentEngineering

LocationSeattle, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 23, 2026