Role: Data Engineer
We are seeking a highly skilled Data Engineer to set up Change Data Capture (CDC) for multiple database types to support data lake hydration. The ideal candidate should have handson experience with Debezium or other CDC frameworks and strong expertise in ETL transformations using Apache Spark for both streaming and batch data processing.
Key Responsibilities:
- Implement Change Data Capture (CDC) for diverse databases to enable realtime and batch data ingestion.
- Develop ETL pipelines using Apache Spark (PySpark/Java) to transform raw CDC data into structured analyticsready datasets.
- Work with Apache Spark DataFrames Spark SQL and Spark Streaming to build scalable data pipelines.
- Optimize data workflows for performance reliability and scalability in a big data environment.
- Utilize Apache Airflow to orchestrate data pipelines and schedule workflows.
- Leverage AWS services for data ingestion storage transformation and processing (e.g. S3 Glue EMR Lambda Step Functions MWAA).
Required Skills:
- Java: Mid to seniorlevel experience.
- Python (PySpark): Midlevel experience.
- Apache Spark: Proficiency in DataFrames Spark SQL Spark Streaming and ETL pipelines.
- Apache Airflow: Experience managing and scheduling workflows.
- AWS Expertise:
- S3 (CRUD operations)
- EMR & EMR Serverless
- Glue Data Catalog
- Step Functions
- MWAA (Managed Workflows for Apache Airflow)
- AWS Lambda (Pythonbased)
- AWS Batch
NicetoHave Skills (Bonus):
- Scala for Spark development.
- Apache Hudi for incremental data processing and ACID transactions.
- Apache Griffin for data quality and validation.
- Performance tuning and optimization in big data environments.
- AWS Deequ not required but a plus