AWS data services (S3 Glue Redshift Athena Lambda Step Functions Kinesis etc.).
Unity Catalog Pyspark AWS Glue Lambda Step Functions and Apache Airflow
AWS data services (S3 Glue Redshift Athena Lambda Step Functions Kinesis etc.)
Programming skills in Python Scala or Pyspark for data processing and automation
Expertise in SQL and experience with relational and NoSQL databases (e.g. RDS DynamoDB).
Data Pipeline Development: Design develop and optimize ETL/ELT pipelines using AWS & Databricks services such as Unity Catalog Pyspark AWS Glue Lambda Step Functions and Apache Airflow.
Data Integration: Integrate data from various sources including relational databases APIs and streaming data ensuring high data quality and consistency.
Cloud Infrastructure Management: Build and manage scalable secure and cost-efficient data infrastructure using AWS services like S3 Redshift Athena and RDS.
Data Modeling: Create and maintain data models to support analytics and reporting requirements ensuring efficient querying and storage.
Performance Optimization: Monitor and optimize the performance of data pipelines databases and queries to meet SLAs and reduce costs.
Collaboration: Work closely with data scientists analysts and software engineers to understand data needs and deliver solutions that enable business insights.
Security and Compliance: Implement best practices for data security encryption and compliance with regulations such as GDPR CCPA or ITAR.
Automation: Automate repetitive tasks and processes using scripting (Python Bash) and Infrastructure as Code (e.g. Terraform AWS CloudFormation).
Agile Development: Build and optimize continuous integration and continuous deployment (CI/CD) pipelines to enable rapid and reliable software releases using Gitlab in an Agile environment.
Monitoring and Troubleshooting: Set up monitoring and alerting for data pipelines and infrastructure and troubleshoot issues to ensure high availability.