Machine Learning Engineer (Distributed Training)

Machine Learning Engineer (Distributed Training)

CloudWalk

Posted on : 24-08-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Register to Apply Apply Now

Share

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

São Paulo - Brazil

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 24-08-2025

Job Description

Who we are:

Who Were Looking For:

Were looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. Youll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers Accelerate DeepSpeed FSDP and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.

This role is not research - its about building and scaling the systems that let researchers move fast and models grow big. Youll work closely with MLOps infra and model developers to make our training runs efficient resilient and reproducible.

What Youll Do:

Own the architecture and maintenance of our distributed training pipeline;
Train LLMs using tools like DeepSpeed FSDP and Hugging Face Accelerate;
Design and debug multi-node/multi-GPU training runs (Kubernetes-based);
Optimize training performance: memory usage speed throughput and cost;
Help manage experiment tracking artifact storage and resume logic;
Build reusable scalable training templates for internal use;
Collaborate with researchers to bring their training scripts into production shape.

What Were Looking For:

Expertise in distributed training: Experience with DeepSpeed FSDP or Hugging Face Accelerate in real-world multi-GPU or multi-node setups;
Strong PyTorch background: Comfortable writing custom training loops schedulers or callbacks;
Hugging Face stack experience: Transformers Datasets Accelerate - you know the ecosystem and how to bend it;
Infra literacy: You understand how GPUs containers and job schedulers work together. You can debug cluster issues memory bottlenecks or unexpected slowdowns;
Resilience mindset: You write code that can checkpoint resume log correctly and keep running when things go wrong;
Collaborative builder: You dont mind digging into other peoples scripts making them robust and helping everyone train faster.

Bonus Points:

Experience with Kubernetes-based GPU clusters and Ray;
Experience with experiment tracking (MLflow W&B);
Familiarity with mixed precision ZeRO stages model parallelism;
Comfort with CLI tooling profiling logging and telemetry;
Experience with dataloading bottlenecks and dataset streaming.

How We Hire:

Online assessment: technical logic and fundamentals (Math/Calculus Statistics Probability Machine Learning/Deep Learning Code)
Technical interview: deep dive into distributed training theory and reasoning (no code)
Cultural interview

If you are not willing to take an online quiz do not apply.

If youve trained LLMs before - or helped others do it better - this role is for you. Even if you dont check every box if youre confident working with distributed compute and real-world LLM workloads we want to hear from you.

Employment Type

Full-Time

Company Industry

Key Skills

Apply Now

About Company

CloudWalk

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Similar Jobs

Staff Engineer, Customer Success

Palo Alto Networks -

Field Service Engineer 2 - 2nd Shift

Intuitive -

Staff Backend Engineer, Search Quality (Open to remote across ANZ)

Canva -

Machine Learning Engineer

The Toronto-dominion Bank - Toronto

Senior Machine Learning Engineer

Launch Potato - Recife

Senior Machine Learning Engineer

Launch Potato - Charlotte , NC

Lead Machine Learning Engineer

Launch Potato - Las Vegas , NV

Senior Machine Learning Engineer

Launch Potato - Tallahassee , FL

Dr. Job is an online platform that connects employers with skilled job seekers, facilitating the search for job opportunities and top talent. Established in 2015. Dr. Job Pro has emerged as the world premier job portal, attracting thousands of job seekers every day from all over the world.

Follow Dr.Job

Payment accepted icon

Dr Job FZ LLC. 2025 © All Rights Reserved

Company

Popular Searches

Popular jobs in worldwide

Employer

Job seeker

Job Seeker Login

Job Seeker Register

AI Job interview

Companies by Location

Jobs by Countries

Jobs In Saudi Arabia

Jobs In Bahrain