Principal Staff Software Engineer, AI Training Platform

LinkedIn

Not Interested
Bookmark
Report This Job

profile Job Location:

Mountain View, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 4 hours ago
Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

This role will be based in Mountain View CA.

At LinkedIn we trust each other to do our best work where it works best for us and our teams. This role offers hybrid work options meaning you can work from home and commute to a LinkedIn office depending on whats best for you and when your team needs to be together.

As part of LinkedIns AI Platform group the AI Training team is responsible for developing and maintaining highly available and scalable deep learning training solutions to power our rapidly growing AI use cases. The team is responsible for scaling LinkedIns AI model training with hundreds of billions of parameters for all AI use cases from recommendation models large language models (Generative AI) to computer vision models. We optimize training performance across algorithms AI frameworks infrastructure software and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team also works closely with the open source community and has many open source committers (TensorFlow Horovod Ray Hadoop etc.) in the team. Additionally this team focussed on technologies like LLMs GNNs Incremental Learning Online Learning and advanced LLM Agents work for Training infrastructure.

As a Principal Staff Software Engineer on the AI Training Infra team you will play a crucial role in leading and building the next-gen training infrastructure to power AI use cases. You will design and implement high performance AI Training pipeline data I/O work with open source teams to identify and resolve issues in popular libraries like Huggingface Horovod and PyTorch debug and optimize deep learning training and provide advanced support for internal AI teams in areas like model parallelism data parallelism Zero automatic mixed precision and kernel fusion. Finally you will assist in and guide the development of containerized pipeline orchestration infrastructure including developing and distributing stable base container images providing advanced profiling and observability and updating internally maintained versions of deep learning frameworks and their companion libraries like Tensorflow PyTorch DeepSpeed GNNs Flash Attention and more.

Responsibilities

  • Owning the technical strategy for broad or complex requirements with insightful and forward-looking approaches that go beyond the direct team and solve large open-ended problems.
  • Designing implementing and optimizing the performance of large-scale distributed training for personalized recommendation as well as large language models.
  • Improving the observability and understandability of various systems with a focus on improving developer productivity and system sustenance.
  • Mentoring other engineers defining our challenging technical culture and helping to build a fast-growing team.
  • Working closely with the open-source community to participate and influence cutting edge open-source projects (e.g. PyTorch GNNs DeepSpeed Huggingface etc.).
  • Functioning as the tech-lead for several concurrent key initiatives for the Training Infrastructure and defining the future of AI training platforms.

Qualifications :

Basic Qualifications

  • BS/BA in Computer Science or related technical field or equivalent technical experience
  • 7 years of industry experience in software design development and algorithm related solutions
  • 7 years of experience programming in object-oriented languages such as Python C Java Go Rust Scala
  • 5 years of experience as an architect or technical leadership position
  • 5 years of experience in the industry with leading / building deep learning systems
  • Hands-on experience developing distributed systems or other large-scale systems

Preferred Qualifications

  • MS or PhD in Computer Science or related technical discipline.
  • 12 years of experience in software design development and algorithm related solutions with at least 5 years of experience in a technical leadership position
  • 12 years of experience in an object-oriented programming language such as Python C Java Go Rust Scala
  • 5 years of experience with large-scale distributed systems and client-server architectures
  • Co-author or maintainer of any open-source projects
  • Expertise in machine learning infrastructure including technologies like MLFlow Kubeflow and large scale distributed systems
  • Familiarity with containers and container orchestration systems
  • Expertise in deep learning frameworks and tensor libraries like PyTorch Tensorflow JAX/FLAX

Suggested Skills

  • ML Algorithm Development
  • Machine Learning / Deep Learning
  • Big Data
  • Stakeholder Management

LinkedIn is committed to fair and equitable compensation practices. The pay range for this role is $207000 to $340000. Actual compensation packages are based on several factors that are unique to each candidate including but not limited to skill set depth of experience certifications and specific work location. This may be different in other locations due to differences in the cost of labor. The total compensation package for this position may also include annual performance bonus stock benefits and/or other applicable incentive compensation plans. For more information visit Information :

Equal Opportunity Statement 

We seek candidates with a wide range of perspectives and backgrounds and we are proud to be an equal opportunity employer. LinkedIn considers qualified applicants without regard to race color religion creed gender national origin age disability veteran status marital status pregnancy sex gender expression or identity sexual orientation citizenship or any other legally protected class.

LinkedIn is committed to offering an inclusive and accessible experience for all job seekers including individuals with disabilities. Our goal is to foster an inclusive and accessible workplace where everyone has the opportunity to be successful.

If you need a reasonable accommodation to search for a job opening apply for a position or participate in the interview process connect with us at and describe the specific accommodation requested for a disability-related limitation.

Reasonable accommodations are modifications or adjustments to the application or hiring process that would enable you to fully participate in that process. Examples of reasonable accommodations include but are not limited to:

  • Documents in alternate formats or read aloud to you
  • Having interviews in an accessible location
  • Being accompanied by a service dog
  • Having a sign language interpreter present for the interview

A request for an accommodation will be responded to within three business days. However non-disability related requests such as following up on an application will not receive a response.

LinkedIn will not discharge or in any other manner discriminate against employees or applicants because they have inquired about discussed or disclosed their own pay or the pay of another employee or applicant. However employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information unless the disclosure is (a) in response to a formal complaint or charge (b) in furtherance of an investigation proceeding hearing or action including an investigation conducted by LinkedIn or (c) consistent with LinkedIns legal duty to furnish information.

San Francisco Fair Chance Ordinance

Pursuant to the San Francisco Fair Chance Ordinance LinkedIn will consider for employment qualified applicants with arrest and conviction records.

Pay Transparency Policy Statement

As a federal contractor LinkedIn follows the Pay Transparency and non-discrimination provisions described at this link: Data Privacy Notice for Job Candidates

Please follow this link to access the document that provides transparency around the way in which LinkedIn handles personal data of employees and job applicants: Work :

No


Employment Type :

Full-time

This role will be based in Mountain View CA.At LinkedIn we trust each other to do our best work where it works best for us and our teams. This role offers hybrid work options meaning you can work from home and commute to a LinkedIn office depending on whats best for you and when your team needs to b...
View more view more

Key Skills

  • Continuous Integration
  • Docker
  • Jenkins
  • Python
  • System Design
  • Agile
  • C/C++
  • Go
  • Systems Engineering
  • Software Development
  • Java
  • Distributed Systems

About Company

Company Logo

LinkedIn is the world’s largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful connections, discover exciting opportunities, build necessary skills, and gain valuable insights every day. We’re ... View more

View Profile View Profile