drjobs MLOps Lead, Central Technology

MLOps Lead, Central Technology

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Redwood City - USA

Monthly Salary drjobs

$ 241000 - 331000

Vacancy

1 Vacancy

Job Description

The Chan Zuckerberg Initiative was founded by Priscilla Chan and Mark Zuckerberg in 2015 to help solve some of societys toughest challenges from eradicating disease and improving education to addressing the needs of our local communities. Our mission is to build a more inclusive just and healthy future for everyone.

The Team

Founded by Priscilla Chan and Mark Zuckerberg in 2015 the Chan Zuckerberg Initiative (CZI) is a new kind of philanthropy thats leveraging technology to help solve some of the worlds toughest challenges from eradicating disease to improving education to reforming the criminal justice system. Our mission is to create a future for everyone. Across our core Initiative focus areas of Science and Education were pairing engineering with grantmaking impact investing policy work and movement building to help build an inclusive just and healthy future for everyone.

Our Values

  • We believe we can help build a future for everyone.
  • We aim to be daring but humble: We look for bold ideas regardless of structure and stage and help them scale by pairing engineers with subject matter experts to build tools that accelerate the pace of social progress.
  • We want to learn fast but build for the long-term: We want to iterate fast and help bring new solutions to the table but we also realize that important breakthroughs often take decades or even centuries.
  • Stay close to the real problems: We engage directly in the communities we serve because no one understands our societys challenges like those who live them every day.

Our success is dependent on building teams that include people from different backgrounds and experiences who can challenge each others assumptions with fresh perspectives. To that end we look for a diverse pool of applicants including those from historically marginalized groups women people with disabilities people of color formerly incarcerated people people who are lesbian gay bisexual transgender and/or gender nonconforming first and second generation immigrants veterans and people from different socioeconomic backgrounds.

The Opportunity

Our Central Tech team provides technology and security support for CZI the Biohub Network and our grantees. We believe that Engineering and Security are most effective when in sync and learning from each other on a daily basis. Our AI Infrastructure Engineering team enables our AI Research teams to achieve their goals faster and more securely. We leverage technology to automate manual processes constantly innovate to optimize operations provide first-class support and build solutions to enable the scale and execution of our business partners strategies and initiatives.

The AI/ML and Data Engineering Infrastructure organization works on building shared tools and platforms to be used across all of the Chan Zuckerberg Initiative partnering and supporting the work of a wide range of Research Scientists Data Scientists AI Research Scientists as well as a broad range of Engineers focusing on Education and Science domain problems. Members of the shared infrastructure engineering team have an impact on all of CZIs initiatives by enabling the technology solutions used by other engineering teams at CZI to scale. A person in this role will build these technology solutions and help to cultivate a culture of shared best practices and knowledge around core engineering.

What Youll Do

  • Provide technical MLOps leadership: for a team of MLOps Engineers where you will manage and lead the team in operating our heterogeneous AI training and inference systems as well as collaborating in the design and build of our AI platform components.
  • Drive the application of MLOps and DevOps principles: across our multiple platforms ensuring peak operational efficiency across our AI operations and process automation necessary for a world class large scale AI model training environment.
  • Instrumentation and Observation technical leadership: for the MLOps team defining our end to end metrics program including full proactive monitoring and alerting systems
  • Facilitate model training through collaboration with our AI Researchers: alongside the rest of the AI Infrastructure Eng team work together to make sure that our models we are training and releasing to inference make use of best machine learning and deep learning practices and are through code automation libraries fully resilient to restarts and checkpoint recoveries.
  • Continuous Optimization of our Kubernetes based AI Lifecycle platform: through our IAC based practices and integrating our MLOps AI Lifecycle platform tooling alongside integrating this with our On-Prem HPC systems into a cohesive heterogeneous platform.
  • Collaboration on Data systems for our AI model training: with our Data Infrastructure Eng team as well as the Science data teams on the end to end data usage that drive our AI model training.
  • Lead our MLOps team supporting our on-call rotation: combining a focus on automation and proactive alerting focused on reducing on-call loads and improving self healing AI system operations. This will be low volume but we do have 24/7 coverage and will include members of the rest of the AI team for escalation and on-call coverage.

What Youll Bring

  • BS MS or PhD degree in Computer Science or a related technical discipline or equivalent experience
  • 7 years of relevant coding and systems experience
  • 5 years of systems Architecture and Design experience with a broad range of MLOps experience across Data Infrastructure and AI/ML platforms
  • Proven technical leadership in SRE and MLOps related experience as well as either direct or indirect people management experience
  • Proven SRE and MLOps knowledge and related experience
  • Strong experience scaling containerized applications on Kubernetes or Mesos including expertise with creating custom containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred)
  • Cloud Platform proficiency with Amazon Web Services (AWS) Google Cloud Platform (GCP) or Microsoft Azure and experience with On-Prem and Colocation Service hosting environments
  • MLOps experience working with medium to large scale GPU clusters in Kubernetes (Kubeflow) HPC environments or large scale Cloud based ML deployments
  • Working knowledge of Nvidia CUDA and AI/ML custom libraries.
  • Knowledge of Linux systems optimization and administration
  • Solid Coding experience
  • Proven coding ability with a systems language such as RustC/ C C# Go Java or Scala
  • Expertise with a scripting language such as Python (preferred) PHP or Ruby
  • Experience in integrating Data with the AI Lifecycle
  • AI/ML Platform Operations experience in an environment integrated with challenging data and systems platform challenges
  • Large scale Streaming data systems integration experience
  • Hadoop Spark and/or Kafka deployments or their corollaries such as Pulsar Flink and/or Ray)
  • Workflow scheduling tools such as Apache Airflow Dagster or Apache Beam
  • Understanding of Data Engineering Data Governance Data Infrastructure and AI/ML execution platforms.
  • PyTorch Keras or Tensorflow experience a strong nice to have
  • HPC with and Slurm experience a strong nice to have

Compensation

The Redwood City CA base pay range for this role is $241000 - $331000. New hires are typically hired into the lower portion of the range enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience as evaluated throughout the interview process.

Work Mode

As we grow were excited to strengthen in-person connections and cultivate a collaborative team-oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month approximately 3 days a week with specific in-office days determined by the teams manager. The exact schedule will be at the hiring managers discretion and communicated during the interview process.

Benefits for the Whole You

Were thankful to have an incredible team behind our work. To honor their commitment we offer a wide range of benefits to support the people who make all we do possible.

  • CZI provides a generous employer match on employee 401(k) contributions to support planning for the future.
  • Annual benefit for employees that can be used most meaningfully for them and their families such as housing student loan repayment childcare commuter costs or other life needs.
  • CZI Life of Service Gifts are awarded to employees to live the mission and support the causes closest to them.
  • Paid time off to volunteer at an organization of your choice.
  • Funding for select family-forming benefits.
  • Relocation support for employees who need assistance moving to the Bay Area
  • And more!

If youre interested in a role but your previous experience doesnt perfectly align with each qualification in the job description we still encourage you to apply as you may be the perfect fit for this or another role.

Explore ourwork modesbenefits andinterview processat

Employment Type

Full Time

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.