P-1485
At Databricks we are passionate about empowering data teams to tackle the worlds most challenging problems from bringing the next mode of transportation to reality to accelerating the development of medical breakthroughs. We achieve this by building and operating the worlds best data and AI infrastructure platform enabling our customers to leverage deep data insights and enhance their business. Founded by engineers and customer-obsessed we leap at every opportunity to tackle technical challenges from designing next-gen UI/UX for interfacing with data to scaling our services and infrastructure across millions of virtual machines. And were only getting started.
As a Sr. SRE (Incident Manager) you will utilize your technical experience and resourcefulness to lead urgent customer situations to conclusion. You will be responsible for managing frequent high-quality updates to all internal and external stakeholders. You will advocate with engineering and leadership on behalf of your customers and will ensure that escalations are handled with the appropriate level of urgency from stakeholders.
This role combines operational leadership technical systems knowledge and exceptional communication skills. You will be at the intersection of engineering depth and operational clarity ensuring that every major incident is managed with precision transparency and continuous improvement.
The impact:
- Drive critical customer escalations or widespread outages to conclusion and resolution. Escalate to on-call resources in support and engineering and establish checkpoint calls and action items to ensure that progress is made and status updates are delivered on time.
- Demonstrate cross-functional leadership while establishing ownership of escalations and outages.
- Compile and deliver frequent high-quality communication to internal and external stakeholders including executive staff. Candidate should be comfortable creating concise and effective messaging that is tailored to a technical or executive audience with minimal assistance from others.
- Commence and lead war rooms while establishing other temporary communication channels as warranted for the duration of an outage.
- Ability to multi-task on several incidents and/or projects at once.
- Be the leader who derives product and process improvements from every incident and submits necessary feedback for improvements.
- Participate in on-call rotations.
What are we looking for
- Minimum 5 years of experience in customer support support escalation and incident management is required.
- Minimum 5 years of experience in designing or testing or maintaining Python/Java/Scala-based applications in typical project delivery and consulting environments is required.
- Prior incident management or escalation management experience is required.
- Hands-on experience developing any two or more of the following: Big Data Hadoop Spark Machine Learning Artificial Intelligence Streaming Kafka Data Science ElasticSearch related industry use cases at the production scale.
- Hands-on experience in the performance tuning/troubleshooting of Spark-based applications at a production scale.
- Working knowledge in Data Lakes and preferably on the SCD types use cases at production scale.
- Working and hands-on experience with any SQL-based databases Data Warehousing/ETL technologies like Informatica DataStage Oracle Teradata SQL Server and MySQL
- Linux/Unix administration skills and hands-on experience with AWS or Azure or GCP is required.
- Proven and real-time experience in JVM and Memory Management techniques such as Garbage collections Heap/Thread Dump Analysis is required.
- Excellent analytical and troubleshooting skills are required. Candidate should be able to demonstrate technical excellence by applying engineering principles to solve complex problems.
- Work with a high degree of integrity accountability attention to detail execution and planning expertise.
- Excellent contextual interpretation and writing skill with an effective ability to summarize and communicate to technical and business audiences is required.
- Demonstrates strong ability to make timely decisions for both business and technical perspectives.
- Enjoy working under pressure in a fast and high performance environment.
- Candidate must demonstrate resilience and the capacity to maintain a constructive attitude during high-pressure situations.
- Ability to work holidays and weekends as part of an on-call rotation is required.
- Bachelors degree in Computer Science or a related field is required.
Required Experience:
Manager
P-1485At Databricks we are passionate about empowering data teams to tackle the worlds most challenging problems from bringing the next mode of transportation to reality to accelerating the development of medical breakthroughs. We achieve this by building and operating the worlds best data and AI i...
P-1485
At Databricks we are passionate about empowering data teams to tackle the worlds most challenging problems from bringing the next mode of transportation to reality to accelerating the development of medical breakthroughs. We achieve this by building and operating the worlds best data and AI infrastructure platform enabling our customers to leverage deep data insights and enhance their business. Founded by engineers and customer-obsessed we leap at every opportunity to tackle technical challenges from designing next-gen UI/UX for interfacing with data to scaling our services and infrastructure across millions of virtual machines. And were only getting started.
As a Sr. SRE (Incident Manager) you will utilize your technical experience and resourcefulness to lead urgent customer situations to conclusion. You will be responsible for managing frequent high-quality updates to all internal and external stakeholders. You will advocate with engineering and leadership on behalf of your customers and will ensure that escalations are handled with the appropriate level of urgency from stakeholders.
This role combines operational leadership technical systems knowledge and exceptional communication skills. You will be at the intersection of engineering depth and operational clarity ensuring that every major incident is managed with precision transparency and continuous improvement.
The impact:
- Drive critical customer escalations or widespread outages to conclusion and resolution. Escalate to on-call resources in support and engineering and establish checkpoint calls and action items to ensure that progress is made and status updates are delivered on time.
- Demonstrate cross-functional leadership while establishing ownership of escalations and outages.
- Compile and deliver frequent high-quality communication to internal and external stakeholders including executive staff. Candidate should be comfortable creating concise and effective messaging that is tailored to a technical or executive audience with minimal assistance from others.
- Commence and lead war rooms while establishing other temporary communication channels as warranted for the duration of an outage.
- Ability to multi-task on several incidents and/or projects at once.
- Be the leader who derives product and process improvements from every incident and submits necessary feedback for improvements.
- Participate in on-call rotations.
What are we looking for
- Minimum 5 years of experience in customer support support escalation and incident management is required.
- Minimum 5 years of experience in designing or testing or maintaining Python/Java/Scala-based applications in typical project delivery and consulting environments is required.
- Prior incident management or escalation management experience is required.
- Hands-on experience developing any two or more of the following: Big Data Hadoop Spark Machine Learning Artificial Intelligence Streaming Kafka Data Science ElasticSearch related industry use cases at the production scale.
- Hands-on experience in the performance tuning/troubleshooting of Spark-based applications at a production scale.
- Working knowledge in Data Lakes and preferably on the SCD types use cases at production scale.
- Working and hands-on experience with any SQL-based databases Data Warehousing/ETL technologies like Informatica DataStage Oracle Teradata SQL Server and MySQL
- Linux/Unix administration skills and hands-on experience with AWS or Azure or GCP is required.
- Proven and real-time experience in JVM and Memory Management techniques such as Garbage collections Heap/Thread Dump Analysis is required.
- Excellent analytical and troubleshooting skills are required. Candidate should be able to demonstrate technical excellence by applying engineering principles to solve complex problems.
- Work with a high degree of integrity accountability attention to detail execution and planning expertise.
- Excellent contextual interpretation and writing skill with an effective ability to summarize and communicate to technical and business audiences is required.
- Demonstrates strong ability to make timely decisions for both business and technical perspectives.
- Enjoy working under pressure in a fast and high performance environment.
- Candidate must demonstrate resilience and the capacity to maintain a constructive attitude during high-pressure situations.
- Ability to work holidays and weekends as part of an on-call rotation is required.
- Bachelors degree in Computer Science or a related field is required.
Required Experience:
Manager
View more
View less