PySpark Data Engineer with Cloudera and Cloud Expertise

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 2 days ago

Vacancies: 1 Vacancy

Job Summary

Job Summary
Synechron is seeking a highly experienced PySpark Data Engineer to design develop and maintain scalable high-quality data pipelines within the Cloudera Data Platform (CDP). This role is critical in ensuring reliable data ingestion transformation and availability for advanced business analytics reporting and data science initiatives. The successful candidate will bring a strong background in big data processing data architecture and cloud integration contributing to data-driven decision-making and operational excellence across the organization.

Software Requirements

Required:
- Advanced proficiency in PySpark including handling RDDs DataFrames Spark SQL and optimization techniques
- Hands-on experience with Cloudera Data Platform (CDP) components such as Cloudera Manager Hive Impala HDFS and HBase
- Working knowledge of Hadoop ecosystem Kafka and distributed data processing tools
- Experience with SQL-based data warehousing tools like Hive and Impala
- Scripting skills in Linux (Bash Python) for automation and operational tasks
- Familiarity with orchestration and scheduling tools such as Apache Airflow or Oozie

Preferred:
- Knowledge of cloud-native data services (AWS Glue EMR Azure Data Factory)
- Use of version control systems like Git and CI/CD pipelines (Jenkins GitLab CI)
- Experience with data modeling data governance and metadata management tools

Overall Responsibilities

Design develop and optimize scalable data pipelines using PySpark within the Cloudera Data Platform.
Manage end-to-end data ingestion processes from multiple sources (relational databases APIs file systems) into data lakes or warehouses.
Execute data transformation cleansing and aggregation processes supporting analytical and reporting requirements.
Conduct performance tuning of Spark jobs and related CDP components to ensure efficient resource utilization.
Implement data validation and quality checks ensuring data accuracy and consistency through monitoring and alerting.
Automate data workflows using orchestration tools like Airflow or Oozie to reduce manual intervention.
Monitor pipeline performance troubleshoot failures and implement corrective actions for operational stability.
Collaborate with data architects analysts and data scientists to support large-scale analytics initiatives.
Document data architecture pipeline configurations and operational procedures for ongoing maintenance and governance.
Lead data architecture discussions supporting data privacy security and compliance standards.

Technical Skills (By Category)

Programming Languages (Essential):
- Python (especially PySpark)
- SQL for data extraction validation and analysis

Big Data & Data Management (Essential):
- Spark (PySpark) Hadoop ecosystem HDFS Hive Impala HBase
- Data ingestion and transformation in large distributed environments

Cloud & Platform Technologies (Preferred):
- Cloud-native data processing (AWS EMR Azure HDInsight GCP Dataproc)

Frameworks & Libraries (Essential):
- Spark SQL Spark Streaming
- Data modeling and governance tools (preferred: Apache Atlas or Collibra)

Orchestration & Automation (Preferred):
- Airflow Oozie Jenkins

Security & Data Governance (Preferred):
- Data masking encryption access control in distributed systems

Experience Requirements

Minimum of 5 years as a Data Engineer with deep expertise in PySpark and big data processing
Proven experience designing implementing and maintaining scalable data pipelines in enterprise environments
Strong background with Cloudera Data Platform (CDP) components such as Hive Impala HDFS and HBase
Demonstrated ability to optimize Spark jobs and manage high-volume data workflows
Support experience in cloud environments (AWS Azure or GCP) for data processing is advantageous
Industry experience supporting financial services banking or highly regulated sectors is a plus
Alternative pathways include extensive hands-on Big Data processing experience in data-centric roles with demonstrated expertise in performance tuning and operational stability

Day-to-Day Activities

Develop and optimize Spark (PySpark) data pipelines for ingesting transforming and publishing data in large distributed systems
Monitor data workflows and troubleshoot issues proactively to maintain pipeline health.
Collaborate with data scientists analysts and platform teams to meet data quality security and governance standards.
Automate operational workflows including job scheduling alerting and resource management.
Perform performance tuning of Spark jobs and related components to optimize runtime and resource efficiency.
Conduct data validation anomaly detection and data quality assessments.
Document architecture data flows and operational procedures for compliance and knowledge sharing.
Support ongoing system upgrades data privacy initiatives and cloud migration efforts.

Qualifications

Bachelors or Masters degree in Computer Science Data Engineering Information Systems or equivalent
5 years of hands-on experience in data engineering with an emphasis on PySpark and big data systems
Proven expertise in designing scalable high-performance data pipelines in enterprise environments
Hands-on experience with Cloudera Data Platform (CDP) Hadoop Hive Impala and HBase
Strong SQL and data modeling skills within distributed data architectures
Experience with cloud data services is a plus
Relevant certifications (e.g. AWS Data Analytics Specialty GCP Professional Data Engineer) are advantageous
Strong analytical troubleshooting and communication skills

Professional Competencies

Critical thinking and analytical mindset for complex data workflows and problem resolution
Ability to manage multiple priorities and deliver results in a fast-paced environment
Effective collaboration skills for cross-team data initiatives and stakeholder engagement
Innovation-driven approach for optimizing and automating data processes
Ownership mindset to ensure operational stability and data quality standards
Adaptability and continuous learner to keep pace with evolving big data and cloud technologies

SYNECHRONS DIVERSITY & INCLUSION STATEMENT

Diversity & Inclusion are fundamental to our culture and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity Equity and Inclusion (DEI) initiative Same Difference is committed to fostering an inclusive culture promoting equality diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger successful businesses as a global company. We encourage applicants from across diverse backgrounds race ethnicities religion age marital status gender sexual orientations or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements mentoring internal mobility learning and development programs and more.

All employment decisions at Synechron are based on business needs job requirements and individual qualifications without regard to the applicants gender gender identity sexual orientation race ethnicity disabled or veteran status or any other characteristic protected by law.

Candidate Application Notice

Required Experience:

Job SummarySynechron is seeking a highly experienced PySpark Data Engineer to design develop and maintain scalable high-quality data pipelines within the Cloudera Data Platform (CDP). This role is critical in ensuring reliable data ingestion transformation and availability for advanced business anal...

Software Requirements

Required:
- Advanced proficiency in PySpark including handling RDDs DataFrames Spark SQL and optimization techniques
- Hands-on experience with Cloudera Data Platform (CDP) components such as Cloudera Manager Hive Impala HDFS and HBase
- Working knowledge of Hadoop ecosystem Kafka and distributed data processing tools
- Experience with SQL-based data warehousing tools like Hive and Impala
- Scripting skills in Linux (Bash Python) for automation and operational tasks
- Familiarity with orchestration and scheduling tools such as Apache Airflow or Oozie

Preferred:
- Knowledge of cloud-native data services (AWS Glue EMR Azure Data Factory)
- Use of version control systems like Git and CI/CD pipelines (Jenkins GitLab CI)
- Experience with data modeling data governance and metadata management tools

Overall Responsibilities

Design develop and optimize scalable data pipelines using PySpark within the Cloudera Data Platform.
Manage end-to-end data ingestion processes from multiple sources (relational databases APIs file systems) into data lakes or warehouses.
Execute data transformation cleansing and aggregation processes supporting analytical and reporting requirements.
Conduct performance tuning of Spark jobs and related CDP components to ensure efficient resource utilization.
Implement data validation and quality checks ensuring data accuracy and consistency through monitoring and alerting.
Automate data workflows using orchestration tools like Airflow or Oozie to reduce manual intervention.
Monitor pipeline performance troubleshoot failures and implement corrective actions for operational stability.
Collaborate with data architects analysts and data scientists to support large-scale analytics initiatives.
Document data architecture pipeline configurations and operational procedures for ongoing maintenance and governance.
Lead data architecture discussions supporting data privacy security and compliance standards.

Technical Skills (By Category)

Programming Languages (Essential):
- Python (especially PySpark)
- SQL for data extraction validation and analysis

Big Data & Data Management (Essential):
- Spark (PySpark) Hadoop ecosystem HDFS Hive Impala HBase
- Data ingestion and transformation in large distributed environments

Cloud & Platform Technologies (Preferred):
- Cloud-native data processing (AWS EMR Azure HDInsight GCP Dataproc)

Frameworks & Libraries (Essential):
- Spark SQL Spark Streaming
- Data modeling and governance tools (preferred: Apache Atlas or Collibra)

Orchestration & Automation (Preferred):
- Airflow Oozie Jenkins

Security & Data Governance (Preferred):
- Data masking encryption access control in distributed systems

Experience Requirements

Minimum of 5 years as a Data Engineer with deep expertise in PySpark and big data processing
Proven experience designing implementing and maintaining scalable data pipelines in enterprise environments
Strong background with Cloudera Data Platform (CDP) components such as Hive Impala HDFS and HBase
Demonstrated ability to optimize Spark jobs and manage high-volume data workflows
Support experience in cloud environments (AWS Azure or GCP) for data processing is advantageous
Industry experience supporting financial services banking or highly regulated sectors is a plus
Alternative pathways include extensive hands-on Big Data processing experience in data-centric roles with demonstrated expertise in performance tuning and operational stability

Day-to-Day Activities

Develop and optimize Spark (PySpark) data pipelines for ingesting transforming and publishing data in large distributed systems
Monitor data workflows and troubleshoot issues proactively to maintain pipeline health.
Collaborate with data scientists analysts and platform teams to meet data quality security and governance standards.
Automate operational workflows including job scheduling alerting and resource management.
Perform performance tuning of Spark jobs and related components to optimize runtime and resource efficiency.
Conduct data validation anomaly detection and data quality assessments.
Document architecture data flows and operational procedures for compliance and knowledge sharing.
Support ongoing system upgrades data privacy initiatives and cloud migration efforts.

Qualifications

Bachelors or Masters degree in Computer Science Data Engineering Information Systems or equivalent
5 years of hands-on experience in data engineering with an emphasis on PySpark and big data systems
Proven expertise in designing scalable high-performance data pipelines in enterprise environments
Hands-on experience with Cloudera Data Platform (CDP) Hadoop Hive Impala and HBase
Strong SQL and data modeling skills within distributed data architectures
Experience with cloud data services is a plus
Relevant certifications (e.g. AWS Data Analytics Specialty GCP Professional Data Engineer) are advantageous
Strong analytical troubleshooting and communication skills

Professional Competencies

Critical thinking and analytical mindset for complex data workflows and problem resolution
Ability to manage multiple priorities and deliver results in a fast-paced environment
Effective collaboration skills for cross-team data initiatives and stakeholder engagement
Innovation-driven approach for optimizing and automating data processes
Ownership mindset to ensure operational stability and data quality standards
Adaptability and continuous learner to keep pace with evolving big data and cloud technologies

SYNECHRONS DIVERSITY & INCLUSION STATEMENT

Candidate Application Notice

Required Experience:

Apply Now

About Company

Synechron

Chez Synechron, nous croyons en la puissance du numérique pour transformer les entreprises en mieux. Notre cabinet de conseil mondial combine la créativité et la technologie innovante pour offrir des solutions numériques de premier plan. Les technologies progressistes et les stratégie ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click