Data Engineer – Gen AI

Edison, NJ - USA

Monthly Salary: Not Disclosed

Posted on: 15-10-2025

Vacancies: 1 Vacancy

Job Summary

Job Title: Data Engineer Gen AI
Location: Edison NJ
Domain: IT Services
Duration: Long Term Contract
Looking for W2 Candidates. No C2C

Responsibilities:

Design build and maintain scalable data pipelines to support Generative AI and LLM-based applications.
Collect clean and preprocess structured and unstructured data for model training fine-tuning and retrieval-augmented generation (RAG).
Implement robust data ingestion frameworks integrating APIs streaming sources and external repositories.
Collaborate with AI/ML teams to deliver high-quality domain-specific datasets optimized for transformer-based architectures.
Architect and manage vector databases (e.g. Pinecone FAISS Weaviate) for efficient embedding storage and semantic search.
Optimize data storage retrieval and transformation workflows across multi-cloud and hybrid environments.
Automate data versioning lineage tracking and governance processes to ensure compliance and reproducibility.
Build scalable ETL/ELT frameworks and orchestrate workflows using Airflow Prefect or Dagster.
Contribute to prompt engineering and model evaluation pipelines through metadata enrichment and contextual data provisioning.
Ensure data quality privacy and ethical use standards across all Generative AI applications.

Qualifications:

8 years of professional experience in Data Engineering; 2 years supporting AI/ML or Generative AI workflows.
Proficiency in Python SQL and distributed data processing frameworks (Spark Pyspark Dask).
Strong experience with data pipeline orchestration tools (Airflow Luigi Dagster or Prefect).
Hands-on experience with cloud data ecosystems such as AWS (Glue Redshift S3) Azure Data Factory or GCP BigQuery.
Knowledge of vector databases and embedding models for RAG-based systems.
Familiarity with Lang Chain LLMOps and data preparation for fine-tuning LLMs.
Experience in containerization and orchestration (Docker Kubernetes).
Working knowledge of API integration data governance and data cataloging tools (e.g. Data Hub Amundsen).
Exposure to Generative AI concepts such as embeddings tokenization and prompt optimization.
Understanding of Responsible AI practices data anonymization and bias mitigation techniques.

Best Regards
Tarun K
Phone: 1-
Email:

Job Title: Data Engineer Gen AI Location: Edison NJ Domain: IT Services Duration: Long Term Contract Looking for W2 Candidates. No C2C Responsibilities: Design build and maintain scalable data pipelines to support Generative AI and LLM-based applications. Collect clean and preprocess structure...

Job Title: Data Engineer Gen AI
Location: Edison NJ
Domain: IT Services
Duration: Long Term Contract
Looking for W2 Candidates. No C2C

Responsibilities:

Design build and maintain scalable data pipelines to support Generative AI and LLM-based applications.
Collect clean and preprocess structured and unstructured data for model training fine-tuning and retrieval-augmented generation (RAG).
Implement robust data ingestion frameworks integrating APIs streaming sources and external repositories.
Collaborate with AI/ML teams to deliver high-quality domain-specific datasets optimized for transformer-based architectures.
Architect and manage vector databases (e.g. Pinecone FAISS Weaviate) for efficient embedding storage and semantic search.
Optimize data storage retrieval and transformation workflows across multi-cloud and hybrid environments.
Automate data versioning lineage tracking and governance processes to ensure compliance and reproducibility.
Build scalable ETL/ELT frameworks and orchestrate workflows using Airflow Prefect or Dagster.
Contribute to prompt engineering and model evaluation pipelines through metadata enrichment and contextual data provisioning.
Ensure data quality privacy and ethical use standards across all Generative AI applications.

Qualifications:

8 years of professional experience in Data Engineering; 2 years supporting AI/ML or Generative AI workflows.
Proficiency in Python SQL and distributed data processing frameworks (Spark Pyspark Dask).
Strong experience with data pipeline orchestration tools (Airflow Luigi Dagster or Prefect).
Hands-on experience with cloud data ecosystems such as AWS (Glue Redshift S3) Azure Data Factory or GCP BigQuery.
Knowledge of vector databases and embedding models for RAG-based systems.
Familiarity with Lang Chain LLMOps and data preparation for fine-tuning LLMs.
Experience in containerization and orchestration (Docker Kubernetes).
Working knowledge of API integration data governance and data cataloging tools (e.g. Data Hub Amundsen).
Exposure to Generative AI concepts such as embeddings tokenization and prompt optimization.
Understanding of Responsible AI practices data anonymization and bias mitigation techniques.

Best Regards
Tarun K
Phone: 1-
Email: