Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailRole:Data Engineer
Key Skill: Pyspark Cloudera Data Platfrorm Big data Hadoop Hive Kafka
Responsibilities
Data Pipeline Development: Design develop and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform ensuring data integrity and accuracy.
Data Ingestion: Implement and manage data ingestion processes from a variety of sources (e.g. relational databases APIs file systems) to the data lake or data warehouse on CDP.
Data Transformation and Processing: Use PySpark to process cleanse and transform large datasets into meaningful formats that support analytical needs and business requirements.
Performance Optimization: Conduct performance tuning of PySpark code and Cloudera components optimizing resource utilization and reducing runtime of ETL processes.
Data Quality and Validation: Implement data quality checks monitoring and validation routines to ensure data accuracy and reliability throughout the pipeline.
Automation and Orchestration: Automate data workflows using tools like Apache Oozie Airflow or similar orchestration tools within the Cloudera ecosystem.
Technical Skills
3 years of experience as a Data Engineer with a strong focus on PySpark and the Cloudera Data PlatformPySpark:Advanced proficiency in PySpark including working with RDDs DataFrames and optimization techniques.
Cloudera Data Platform: Strong experience with Cloudera Data Platform (CDP) components including Cloudera Manager Hive Impala HDFS and HBase.
Data Warehousing:Knowledge of data warehousing concepts ETL best practices and experience with SQLbased tools (e.g. Hive Impala).
Big Data Technologies: Familiarity with Hadoop Kafka and other distributed computing tools.
Orchestration and Scheduling: Experience with Apache Oozie Airflow or similar orchestration frameworks.
Scripting and Automation: Strong scripting skills in Linux.
Full-Time