Data Scientist GenAI Models and AI Agentic Systems
Remote
Rate $75/hr
IBM Internal
Job Description ::
Position Overview:
We are seeking a Data Scientist with strong expertise in evaluating open-source Generative AI models creating compelling data visualizations and designing agentic AI workflows. This is a unique opportunity to work on the frontier of Generative AI validating LLMs and agents that power critical enterprise Legal use cases. This position will work within the Legal department. While there is some flexibility in location work hours for this position are expected to largely align to US standard working days.
The ideal candidate is fluent in the nuances of GenAI model performance understands both open-source models (e.g. LLaMA 3 Mistral Falcon Gemma) and enterprise-grade models like IBM Granite and is comfortable developing multi-agent orchestration pipelines using frameworks such as LangGraph CrewAI AutoGen or LangChain Agents. You will also collaborate with MLOps teams to deploy and evaluate models using platforms like Hugging Face MLflow and Weights & Biases.
Key Responsibilities:
Model Validation & Evaluation
- Design rigorous evaluation pipelines for GenAI models including IBM Granite models and open-source alternatives hosted on Hugging Face Hub.
- Define metrics and build test suites to assess model behavior across factuality coherence bias safety hallucination and robustness.
- Leverage tools such as Ragas TruLens Giskard or LM Eval Harness to compare performance across prompts tasks and domains.
Data Visualization & Insight Communication
- Create compelling dashboards and visual narratives to track model performance error distributions and drift over time.
- Use libraries like Plotly Dash Seaborn or Matplotlib to build visuals that support real-time and retrospective evaluations.
- Summarize and present findings to cross-functional teams including product owners compliance stakeholders and LLM developers.
Agentic Frameworks & AI Workflows
- Develop and validate multi-agent systems using frameworks like LangGraph CrewAI LangChain Agents or AutoGen.
- Optimize agent roles memory and tool selection for task-oriented pipelines in legal enterprise or research domains.
- Evaluate how different orchestration strategies (e.g. sequential vs. parallel static vs. dynamic role assignment) affect task quality and efficiency.
Required Qualifications:
- 3 years of experience in data science NLP or ML research roles or equivalent academic credentials.
- Practical experience evaluating open-source LLMs (e.g. LLaMA Mistral Falcon Gemma) and familiarity with IBM Granite model capabilities.
- Fluency in Python and data science tools: Pandas NumPy Scikit-learn Jupyter and at least one visualization library.
- Familiarity with GenAI evaluation frameworks and benchmarks (e.g. TruthfulQA MMLU BBQ Hellaswag).
- Experience with versioned model repositories and libraries via Hugging Face Transformers and Datasets.
- Minimum Masters degree in computer science Data Science math or related
Preferred Qualifications:
- Demonstrated experience with agentic frameworks (e.g. LangGraph CrewAI AutoGen or LangChain and Langflow Agents).
- Knowledge of Retrieval-Augmented Generation (RAG) vector databases (e.g. FAISS Weaviate Chroma) and hybrid retrieval strategies.
- Exposure to prompt tuning chain-of-thought prompting function/tool calling and memory-aware agents.
- Prior work in regulated environments with a focus on explainability auditing or trustworthy AI.
- Experience deploying and monitoring models in cloud platforms (e.g. AWS SageMaker Azure ML IBM ).