Title: Databricks and AWS Focused Data Engineer
Location: Columbus OH - Onsite
Overview:
We are seeking an experienced data engineer to deliver high-quality scalable data solutions on Databricks and AWS for one of our Big Four clients. You will build and optimize pipelines implement medallion architecture integrate streaming and batch sources and enforce strong governance and access controls to support analytics and ML use cases.
Key Responsibilities:
- Build and Maintain Data Pipelines: Develop scalable data pipelines using PySpark and Spark within the Databricks environment.
- Implement Medallion Architecture: Design workflows using raw trusted and refined layers to drive reliable data processing.
- Integrate Diverse Data Sources: Connect data from Kafka streams extract channels and APIs.
- Data Cataloging and Governance: Model and register datasets in enterprise data catalogs ensuring robust governance and accessibility.
- Access Control: Manage secure role-based access patterns to support analytics AI and ML needs.
- Team Collaboration: Work closely with peers to achieve required code coverage and deliver high-quality well-tested solutions.
- Optimize and Operationalize: Tune Spark jobs (partitioning caching broadcast joins AQE) manage Delta Lake performance (Z-Ordering OPTIMIZE VACUUM) and implement cost and reliability best practices on AWS.
- Data Quality and Testing: Implement data quality checks and validations (e.g. Great Expectations custom PySpark checks) unit/integration tests and CI/CD for Databricks Jobs/Workflows.
- Infrastructure as Code: Provision and manage Databricks and AWS resources using Terraform (workspaces clusters jobs secret scopes Unity Catalog objects S3 IAM).
- Monitoring and Observability: Set up logging metrics and alerts (CloudWatch Datadog Databricks audit logs) for pipelines and jobs.
- Documentation: Produce clear technical documentation runbooks and data lineage for governed datasets.
Required Skills & Qualifications:
- Databricks: 6-9 years of experience with expert-level proficiency
- PySpark/Spark: 6-9 years of advanced hands-on experience
- AWS: 6-9 years of experience with strong competency including S3 and Terraform for infrastructure-as-code
- Data Architecture: Solid knowledge of the medallion pattern and data warehousing best practices
- Data Pipelines: Proven ability to build optimize and govern enterprise data pipelines
- Delta Lake and Unity Catalog: Expertise in Delta Lake internals time travel schema evolution/enforcement and Unity Catalog RBAC/ABAC
- Streaming: Hands-on experience with Spark Structured Streaming Kafka checkpointing exactly-once semantics and late-arriving data handling
- CI/CD: Experience with Git-based workflows and CI/CD for Databricks (e.g. Databricks Repos dbx GitHub Actions Azure DevOps or Jenkins)
- Security and Compliance: Experience with IAM KMS encryption secrets management token/credential rotation and PII governance
- Performance and Cost: Demonstrated ability to tune Spark jobs and optimize Databricks cluster configurations and AWS usage for cost and throughput
- Collaboration: Experience working in Agile/Scrum teams peer reviews and achieving code coverage targets
Preferred Skills & Qualifications:
- Certifications: Databricks Data Engineer Professional AWS Solutions Architect/Developer HashiCorp Terraform Associate
- Data Catalogs: Experience with enterprise catalogs such as Collibra or Alation and lineage tooling such as OpenLineage
- Orchestration: Databricks Workflows and/or Airflow
- Additional AWS: Glue Lambda Step Functions CloudWatch Secrets Manager
- Testing: pytest chispa Great Expectations dbx test
- Domain Experience: Analytics and ML feature pipelines MLOps integrations