About the Role:
We are seeking a highly skilled and experienced SREDatabricks Platform Administrator to join our DataOperations this critical role you will be responsible for the availability performanceReliability and scalability of our enterprise Databricks platform. You will blend deep expertise in Databricks administration with SRE principles to automate operations proactively identify and resolve issues and ensure a seamless experience for our data engineering data science and analytics teams. You will champion best practices for platform governance security and cost optimization playing a pivotal role in our data ecosystem.
Key Responsibilities:
Platform Operations & Reliability:
Design implement and maintain the Databricks platform infrastructure across multiple cloud environments (AWS Azureor GCP).
Ensure high availability disaster recovery and business continuity of Databricks workspaces clusters and associated services.
Develop and implement robust monitoring alerting and logging solutions for the Databricks platform using tools like Prometheus Grafana ELK stack or cloud-native monitoring services (CloudWatch Azure Monitor GCP Operations Suite).
Proactively identify and address performance bottlenecks resource constraints and potential issues within the Databricks environment.
Participate in on-call rotations to respond to and resolve critical incidents swiftly performing root cause analysis (RCA) and implementing preventative measures.
Manage and optimize Databricks clusters including auto-scalinginstance types and cluster policies for both interactive and job compute workloads to ensure cost-effectiveness and performance.
Automation & Tooling:
Develop and maintain Infrastructure as Code (IaC) using tools like Bicep/Terraform or CloudFormation to automate the provisioning configuration and management of Databricks resources.
Automate repetitive operational tasks deployments and environment provisioning using scripting languages (PythonBash) and CI/CD pipelines (Jenkins Azure DevOps GitLab CI).
Build and maintain custom tools and scripts to enhance Databricks platform capabilities improve observability and streamline workflows.
Security & Governance:
Implement and enforce Databricks security best practices including identity and access management (IAM) with Unity Catalog SSO integration (Azure AD Okta) service principals and granular access controls (RBAC row-level/column-level security).
Ensure compliance with organizational security policies data governance standards and regulatory requirements (e.g. GDPRHIPAA industry-specific compliance).
Conduct security audits and vulnerability assessments of the Databricks environment.
Manage secrets using Databricks secrets or a cloud provider secret manager.
Performance Optimization & Cost Management:
Analyze Databricks usage patterns DBU consumption and cloud resource costs to identify opportunities for optimization and efficiency gains.
Implement strategies for cost control including spot instances utilization intelligent cluster resizing and effective use of instance pools.
Work with data teams to optimize Spark jobs notebooks and SQL queries for performance and cost.
Collaboration & Mentorship:
Collaborate closely with data engineers data scientists architects and other SREs to understand their requirements and provide expert guidance on Databricks best practices.
Provide technical leadership and mentorship to junior administrators and engineers fostering a culture of reliability and operational excellence.
Stay up-to-date with the latest Databricks features cloud services and SRE methodologies evaluating and recommending new technologies.
IT Services and IT Consulting