onsite
Senior Software Engineer - AI/ML, AWS Neuron Distributed Training - Amazon
Software Engineer
Lead the design and implementation of distributed AI/ML workloads on AWS Neuron, optimizing performance across Trainium and Inferentia hardware to accelerate customer solutions.
About the role
Key Responsibilities
- Architect and develop scalable distributed training pipelines using AWS Neuron on Trainium (Trn1/Trn2) and Inferentia (Inf1/Inf2) hardware.
- Collaborate with cross‑functional teams to integrate custom silicon accelerators into cloud‑native ML workflows.
- Optimize model performance, memory usage, and inference latency through code and hardware tuning.
- Implement monitoring, profiling, and debugging tools to ensure reliability and efficiency at scale.
- Contribute to open‑source Neuron SDK enhancements and provide technical guidance to internal stakeholders.
Requirements
- 5+ years of software engineering experience in AI/ML, with a strong focus on distributed training.
- Proficiency in Python and experience with deep learning frameworks (PyTorch, TensorFlow).
- Hands‑on experience with AWS services (SageMaker, EC2, EKS) and familiarity with Neuron SDK.
- Solid understanding of distributed systems, parallel computing, and performance optimization.
- Excellent problem‑solving skills and a passion for pushing the boundaries of ML hardware acceleration.
Skills
pythonawsmachine learning