Machine Learning Engineer II, Amazon Music AI and Personalization
Seattle, OR - USA
Department:
Job Summary
Key job responsibilities
Model Training Optimization
- Design and implement strategies to improve training throughput and reduce time-to-convergence
- Profile and eliminate bottlenecks in data loading preprocessing and model computation
- Develop and maintain training infrastructure that scales efficiently with model and dataset size
Inference Optimization
- Optimize models for low-latency high-throughput production inference
- Implement and benchmark inference optimizations across various hardware targets (GPU CPU edge devices)
- Establish performance benchmarks and monitoring for inference pipelines
Service Ownership & Operations
- Own production services that support ML decision models including ranking services orchestration layers and model-serving infrastructure
- Participate in on-call rotation to ensure service reliability respond to operational issues and drive continuous improvement
- Design and implement monitoring alerting and observability solutions for ML services to proactively identify and resolve issues
- Manage service dependencies API contracts and integration points between ML models and downstream systems
- Drive operational excellence through automation runbook development and post-incident reviews
Cross-Functional Collaboration
- Partner with research teams to understand model architectures and identify optimization opportunities
- Collaborate with Science/ML teams on service integration points and ownership boundaries for ML components
- Contribute to best practices and tooling for ML efficiency across the organization
- Evaluate emerging hardware and software technologies for potential adoption
A day in the life
An MLEs day typically begins with checking model performance metrics and reviewing overnight training runs. Mornings often involve team standups and planning sessions. The core work includes cleaning and preprocessing data developing and fine-tuning models writing Python code (both by yourself and via GenAI coding tools) and debugging pipelines. Afternoons might feature collaboration with data scientists and software engineers code reviews and deploying models to production. Service ownership responsibilities include monitoring production systems responding to alerts participating in on-call rotations and ensuring model reliability and performance in live environments. Time is spent reading research papers and attending annual conferences to stay current on state of the art model training and online inference optimization techniques.
- Bachelors degree in computer science or equivalent
- 3 years of full software development life cycle including coding standards code reviews source control management build processes testing and operations experience
- Experience in machine learning data mining information retrieval statistics or natural language processing
- Experience programming with at least one modern language such as Java C or C# including object-oriented design
- Experience with Machine Learning and Large Language Model fundamentals including architecture training/inference lifecycles and optimization of model execution
- Experience building complex software systems that have been successfully delivered to customers
- Experience with Machine and Deep Learning toolkits such as MXNet TensorFlow Caffe and PyTorch
- Experience in production monitoring and metrics reporting
- Experience building deploying and maintaining large-scale machine learning infrastructure using distributed data processing frameworks such as Spark or Ray
- Experience owning and operating production services including on-call responsibilities incident management and operational metrics
- Masters degree in computer science or equivalent
- Expertise in large-model inference optimization including techniques such as quantization pruning and distillation
- Demonstrated experience designing semantic search or RAG pipelines integrating embeddings vector stores and generative models
- Proficiency in online and offline experimentation evaluation frameworks and metrics instrumentation for ML systems
- Experience with service-oriented architectures microservices design patterns and managing service dependencies in complex ML systems
- Strong collaboration and communication skills with the ability to bridge science and engineering to deliver end-to-end ML solutions
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon also offers comprehensive benefits including health insurance (medical dental vision prescription Basic Life & AD&D insurance and option for Supplemental life plans EAP Mental Health Support Medical Advice Line Flexible Spending Accounts Adoption and Surrogacy Reimbursement coverage) 401(k) matching paid time off and parental leave. Learn more about our benefits at WA Seattle - 143700.00 - 194400.00 USD annually
Required Experience:
IC
About Company
Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more