JOB DESCRIPTION :
Roles and Responsibilities:
- Data Pipeline Development:
- Design build and maintain robust scalable and high-performance data pipelines (batch and real-time).
- Automate data ingestion from a variety of structured and unstructured sources (APIs flat files databases streams etc.).
- Ensure data pipelines are optimized for performance and cost-effectiveness.
- Data Integration & Transformation:
- Develop ETL/ELT workflows to clean transform and normalize raw data into usable formats.
- Integrate data from different sources into a unified data warehouse or data lake.
- Collaboration:
- Work with data scientists analysts and software engineers to understand data needs and deliver efficient solutions.
- Collaborate with DevOps and cloud teams to ensure pipelines run reliably in production environments.
- Monitoring and Debugging:
- Monitor and troubleshoot production data pipelines ensuring data accuracy and completeness.
- Implement alerting logging and auditing systems for data pipelines.
- Architecture and Design:
- Influence data architecture and contribute to designing scalable systems.
- Participate in decision-making regarding storage strategies partitioning and indexing.
- Data Governance and Quality:
- Implement and maintain data quality checks and lineage tracking.
- Ensure compliance with data security and privacy policies (e.g. GDPR HIPAA).
Typical Tech Stack:
Programming & Scripting:
- Python Scala or Java (Python is most common)
- SQL for querying and transformation
Data Pipeline & ETL Tools:
- Apache Airflow Luigi Dagster Prefect
- AWS Glue / Azure Data Factory / Google Dataflow (based on cloud provider)
Data Processing Frameworks:
- Apache Spark Apache Beam Flink or Kafka Streams
Message Brokers & Streaming:
- Apache Kafka AWS Kinesis Google Pub/Sub
Databases & Storage:
- Relational: PostgreSQL MySQL
- NoSQL: MongoDB Cassandra
- Data Warehouses: Snowflake Redshift BigQuery Databricks
- Object Storage: Amazon S3 Azure Blob GCS
DevOps & CI/CD:
- Docker Kubernetes
- Git Jenkins GitLab CI/CD Terraform
Monitoring & Logging:
- Prometheus Grafana
- ELK Stack Datadog CloudWatch
Soft Skills / Other Expectations:
- Strong understanding of distributed systems and data modeling
- Excellent problem-solving and debugging skills
- Ability to lead initiatives and mentor junior engineers
- Strong communication and documentation abilities