Staff Software Engineer AI Research Infrastructure

Databricks

Not Interested
Bookmark
Report This Job

profile Job Location:

New York City, NY - USA

profile Monthly Salary: Not Disclosed
Posted on: 18 hours ago
Vacancies: 1 Vacancy

Job Summary

Staff Software Engineer - AI Research Infrastructure

P-1215

At Databricks we are obsessed with enabling data teams to solve the worlds toughest problems from security threat detection to cancer drug development. We do this by building and running the worlds best data and AI platform so our customers can focus on the high-value challenges that are central to their own missions.

The Databricks AI Research organization enables companies to develop AI models and agents using their own data with technologies ranging from post-training open source LLMs to developing advanced multi-agent architectures. Databricks AI does so by producing novel science and putting it into production. Databricks AI is committed to the belief that a companys AI models and agents are just as valuable as any other core IP and that high-quality AI should be available to all.

Job Description

As a Staff Software Engineer AI Research Infrastructure you will be developing and running the research stack that powers Databricks AI Research. You will design and build services that schedule orchestrate and observe largescale training and inference experiment workloads across thousands of GPUs improve our dev tooling and ensure that researchers can iterate quickly without sacrificing reliability efficiency or security.

Youll partner closely with research scientists ML engineers and platform teams to turn experimental workloads into robust repeatable pipelines and to push the limits of what our infrastructure can support.

The Impact you will have

As a Staff Software Engineer on the AI Research Infra Team at Databricks you will:

  • Design and implement infrastructure that supports largescale experiments data processing and model training (e.g. HPC clusters GPU fleets or cloudbased systems)
  • Enable researchers to go from idea to largescale experiment in minutes not days by building powerful abstractions for job submission scheduling and monitoring.
  • Create tooling that improves research developer productivity such as experiment management systems CI/testing infrastructure for research code and workflows that reduce iteration time.
  • Influence the longterm roadmap for research computation shaping how Databricks AI Research train evaluate and ship models to customers.
  • Serve as a technical mentor and force multiplier for other engineers working on compute infra and AI systems.

What We Look for

  • BS/MS or PhD in Computer Science or related field
  • 5 years of software engineering experience including substantial time working on largescale distributed systems or infrastructure.
  • Have deep experience with building and operating distributed systems data pipelines or largescale backend services ideally involving GPUs clusters or major cloud providers.
  • Are proficient in one or more systems programming languages (e.g. C Rust Go Java Scala) and can design implement and debug complex services.
  • Have built or significantly contributed to cluster schedulers resource managers or largescale job orchestration systems (e.g. Kubernetes Slurm Ray custom internal systems).
  • Understand modern ML training and inference workflows (e.g. distributed training model parallelism finetuning evaluation) even if youre not primarily a research scientist.
  • Can move fast and be pragmatic in getting things done while caring about operational excellence. Have driven complex systems from prototype to stable wellowned services.
  • Communicate clearly with both researchers and engineers and enjoy translating between research needs and infra realities.

Required Experience:

Staff IC

Staff Software Engineer - AI Research InfrastructureP-1215At Databricks we are obsessed with enabling data teams to solve the worlds toughest problems from security threat detection to cancer drug development. We do this by building and running the worlds best data and AI platform so our customers c...
View more view more

About Company

Company Logo

The Databricks Platform is the world’s first data intelligence platform powered by generative AI. Infuse AI into every facet of your business.

View Profile View Profile