At Databricks we are passionate about enabling data teams to solve the worlds toughest problems from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the worlds best data and AI infrastructure platform so our customers can use deep data insights to improve their business.
About the Team: Join the Core of Databricks Infrastructure
The Compute Infra Team is the engine behind all of Databricks products and Control Plane services. We build and scale the foundational compute infrastructure that enables every Databricks customer to succeed operating one of the largest and most dynamic data and AI clouds in the world.
Mission: Define the Future of Cloud Compute Efficiency and Scale
As the Technical Lead for Compute Fleet Management you wont just manage a fleetyou will set the standard for how Databricks consumes and optimizes compute across all three major clouds (AWS Azure and GCP). This is a mission-critical role with direct impact on our gross margin and customer experience. Your mandate includes:
- Pioneering Fleet Optimization: Provisioning and pooling of O(Billion)s of cloud resources to achieve peak workload performance industry-leading efficiency and robust resource isolation.
- Delivering Hyper-Scale Resilience: Build the architecture that guarantees horizontal scaling and resilience against zonal or even cloud account-level failures ensuring Databricks is always on.
- Owning the Critical Path: Lead the development of the lowest-dependency systems required to bootstrap and manage our massive compute platform.
Outcomes: The Impact You Will Deliver
This role is for an engineer who thrives on owning the most challenging and impactful outcomes:
- High Availability: Achieve and maintain 99.99% availability for all batch and serving workloads.
- Stellar Efficiency: Drive utilization to 60% or highera crucial metric that requires balancing high efficiency with unwavering tolerance for cloud failures.
- Best-in-Class Isolation: Architect and enforce strong security and performance isolation across a diverse range of customer workloads.
Requirements: Are You Ready for this Challenge
We are seeking a seasoned Principal Engineer who has not only built but successfully operated large-scale mission-critical infrastructure systems in production. You must have a track record of:
- Leading Transformative Projects: Taking ownership of complex cross-team cross-layer and multi-quarter strategic engineering initiatives from concept to execution.
- Distributed Systems Mastery: Deep hands-on experience developing and operating high-scale distributed systems on at least one major public cloud.
- Influence Without Authority: Proven ability to drive consensus establish technical direction and lead large technical efforts across organizational boundaries.
- Execution Discipline: Exceptional strength in planning tracking project progress and managing complex cross-organizational dependencies.
The Edge: Highly Desirable Experience
- Experience managing and scaling a massive fleet of GPUs for AI/ML workloads.
- Experience with developing and operating large-scale distributed systems acrossall major clouds (AWS Azure and GCP).
Required Experience:
Staff IC
At Databricks we are passionate about enabling data teams to solve the worlds toughest problems from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the worlds best data and AI infrastructure platform so o...
At Databricks we are passionate about enabling data teams to solve the worlds toughest problems from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the worlds best data and AI infrastructure platform so our customers can use deep data insights to improve their business.
About the Team: Join the Core of Databricks Infrastructure
The Compute Infra Team is the engine behind all of Databricks products and Control Plane services. We build and scale the foundational compute infrastructure that enables every Databricks customer to succeed operating one of the largest and most dynamic data and AI clouds in the world.
Mission: Define the Future of Cloud Compute Efficiency and Scale
As the Technical Lead for Compute Fleet Management you wont just manage a fleetyou will set the standard for how Databricks consumes and optimizes compute across all three major clouds (AWS Azure and GCP). This is a mission-critical role with direct impact on our gross margin and customer experience. Your mandate includes:
- Pioneering Fleet Optimization: Provisioning and pooling of O(Billion)s of cloud resources to achieve peak workload performance industry-leading efficiency and robust resource isolation.
- Delivering Hyper-Scale Resilience: Build the architecture that guarantees horizontal scaling and resilience against zonal or even cloud account-level failures ensuring Databricks is always on.
- Owning the Critical Path: Lead the development of the lowest-dependency systems required to bootstrap and manage our massive compute platform.
Outcomes: The Impact You Will Deliver
This role is for an engineer who thrives on owning the most challenging and impactful outcomes:
- High Availability: Achieve and maintain 99.99% availability for all batch and serving workloads.
- Stellar Efficiency: Drive utilization to 60% or highera crucial metric that requires balancing high efficiency with unwavering tolerance for cloud failures.
- Best-in-Class Isolation: Architect and enforce strong security and performance isolation across a diverse range of customer workloads.
Requirements: Are You Ready for this Challenge
We are seeking a seasoned Principal Engineer who has not only built but successfully operated large-scale mission-critical infrastructure systems in production. You must have a track record of:
- Leading Transformative Projects: Taking ownership of complex cross-team cross-layer and multi-quarter strategic engineering initiatives from concept to execution.
- Distributed Systems Mastery: Deep hands-on experience developing and operating high-scale distributed systems on at least one major public cloud.
- Influence Without Authority: Proven ability to drive consensus establish technical direction and lead large technical efforts across organizational boundaries.
- Execution Discipline: Exceptional strength in planning tracking project progress and managing complex cross-organizational dependencies.
The Edge: Highly Desirable Experience
- Experience managing and scaling a massive fleet of GPUs for AI/ML workloads.
- Experience with developing and operating large-scale distributed systems acrossall major clouds (AWS Azure and GCP).
Required Experience:
Staff IC
View more
View less