drjobs Staff Machine Learning Engineer - Infinia AI Performance

Staff Machine Learning Engineer - Infinia AI Performance

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Shanghai - China

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the worlds most demanding AI data centers in industries ranging from life sciences and healthcare to financial services autonomous cars Government academia research and manufacturing.

DDNs A3I solutions are transforming the landscape of AI infrastructure. IDC

The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments - Marc Hamilton VP Solutions Architecture & Engineering NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads enabling organizations to extract maximum value from their data. With a proven track record of performance reliability and scalability DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation customer-centricity and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation customer success and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

Staff Machine Learning Engineer - Infinia AI Performance

We are seeking a talented and experienced Sr ML Engineer to help us optimize training inference and Retrieval-Augmented Generation (RAG) pipelines for high-performance AI applications. You will lead the development of connectors to open-source frameworks for data streaming such as Mosaic Streaming Ray Data and and inference optimizations such as K-V caching and LORAX. You will guide a talented organization of engineers focused on advanced end-to-end data platform for ingestion transformation preparation and streaming on high-performance AI applications. Collaborating closely with software developers product teams and partners you will lead experiments with state-of-the-art models using open-source tools and cloud platforms.

Key Responsibilities:

  • Design and implement integration of data ingestion and streaming pipelines with open-source tools like Ray Data Mosaic Streaming Torch Dataloader.
  • Design of optimization for training like asynchronous checkpointing and inference like K-V caching and LORAX.
  • Guide the integration of MLFlow with DDNs Infinia product for comprehensive experiment tracking model versioning and deployment.
  • Drive the implementation and scaling of Retrieval-Augmented Generation (RAG) pipelines to enhance generative model performance.
  • Stay abreast of the latest developments in AIOps AI frameworks optimization and accelerated execution.
  • Identify and implement solutions to optimize training and inference pipeline performance runtime and resource utilization on Infinia.

Qualifications:

  • Bachelors or Masters degree in Computer Science Data Science Machine Learning or related fields.
  • 4 years of experience in machine learning operations (MLOps) or related roles.
  • Proven expertise in building and scaling AI/ML pipelines.
  • Strong understanding of machine learning frameworks and libraries (TensorFlow PyTorch NVIDIA NeMo vLLM TensorRT-LLM).
  • Experience in deploying open-source vector databases at scale.
  • Solid understanding of cloud infrastructure (AWS GCP Azure) and distributed computing.
  • Proficiency with containerization tools (Docker Kubernetes) and infrastructure as code.
  • Excellent problem-solving and troubleshooting skills with attention to detail and performance optimization.
  • Strong communication and collaboration skills.

Preferred Qualifications:

  • Implementation-level understanding of ML frameworks data loaders and data formats.
  • Experience with scaling RAG pipelines and integrating them with generative AI models.
  • Experience in operationalizing AI/ML models in production environments.

Participation in a team on-call rotation providing seven-day week out of hours coverage including the provision of after-hours and weekend support work when required.

DDN

Join our dynamic and driven team where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here youll have the opportunity to work across various areas of the company thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results both in their work ethic and deliverables making strong prioritization skills essential. Additionally we value strong communication skills in all our engineers and researchers as they are crucial for the success of our teams and the company as a whole.

Interview Process: After submitting your application one of our recruiters will review your resume. If your application passes this stage you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview you will enter the main process which can consist of up to four interviews in total:

  • Coding assessment: Often in a language of your choice.
  • Systems design: Translate high-level requirements into a scalable fault-tolerant service (depending on role).
  • Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
  • Meet and greet with the wider team.
  • Our goal is to finish the main process in 2-3 weeks at most.

DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race color religion gender gender identity gender expression transgender sex stereotyping sexual orientation national origin disability protected Veteran Status or any other characteristic protected by applicable federal state or local law.


Required Experience:

Staff IC

Employment Type

Full-Time

Company Industry

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.