Data Engineer, Scientific Data Ingestion

Mithrl

Not Interested
Bookmark
Report This Job

profile Job Location:

San Francisco, CA - USA

profile Monthly Salary: $ 150 - 200
Posted on: Yesterday
Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

ABOUT MITHRL

We envision a world where novel drugs and therapies reach patients in months not years accelerating breakthroughs that save lives.

Mithrl is building the worlds first commercially available AI Co-Scientista discovery engine that empowers life science teams to go from messy biological data to novel insights in minutes. Scientists ask questions in natural language and Mithrl answers with real analysis novel targets and patent-ready reports.

Our traction speaks for itself:

  • 12X year-over-year revenue growth

  • Trusted by leading biotechs and big pharma across three continents

  • Driving real breakthroughs from target discovery to patient outcomes.

WHAT YOU WILL DO

Build and own an AI-powered ingestion & normalization pipeline to import data from a wide variety of sources unprocessed Excel/CSV uploads lab and instrument exports as well as processed data from internal pipelines.

Develop robust schema mapping coercion and conversion logic (think: units normalization metadata standardization variable-name harmonization vendor-instrument quirks plate-reader formats reference-genome or annotation updates batch-effect correction etc.).

Use LLM-driven and classical data-engineering tools to structure semi-structured or messy tabular data extracting metadata inferring column roles/types cleaning free-text headers fixing inconsistencies and preparing final clean datasets.

Ensure all transformations that should only happen once (normalization coercion batch-correction) execute during ingestion so downstream analytics / the AI Co-Scientist always works with clean canonical data.

Build validation verification and quality-control layers to catch ambiguous inconsistent or corrupt data before it enters the platform.

Collaborate with product teams data science / bioinformatics colleagues and infrastructure engineers to define and enforce data standards and ensure pipeline outputs integrate cleanly into downstream analysis and storage systems.

WHAT YOU BRING

Must-have

  • 5 years of experience in data engineering / data wrangling with real-world tabular or semi-structured data.

  • Strong fluency in Python and data processing tools (Pandas Polars PyArrow or similar).

  • Excellent experience dealing with messy Excel / CSV / spreadsheet-style data inconsistent headers multiple sheets mixed formats free-text fields and normalizing it into clean structures.

  • Comfort designing and maintaining robust ETL/ELT pipelines ideally for scientific or lab-derived data.

  • Ability to combine classical data engineering with LLM-powered data normalization / metadata extraction / cleaning.

  • Strong desire and ability to own the ingestion & normalization layer end-to-end from raw upload final clean dataset with an eye for maintainability reproducibility and scalability.

  • Good communication skills; able to collaborate across teams (product bioinformatics infra) and translate real-world messy data problems into robust engineering solutions.

Nice-to-have

  • Familiarity with scientific data types and modalities (e.g. plate-readers genomics metadata time-series batch-info instrumentation outputs).

  • Experience with workflow orchestration tools (e.g. Nextflow Prefect Airflow Dagster) or building pipeline abstractions.

  • Experience with cloud infrastructure and data storage (AWS S3 data lakes/warehouses database schemas) to support multi-tenant ingestion.

  • Past exposure to LLM-based data transformation or cleansing agents building or integrating tools that clean or structure messy data automatically.

  • Any background in computational biology / lab-data / bioinformatics is a bonus though not required.

WHAT YOU WILL LOVE AT MITHRL

  • Mission-driven impact: youll be the gatekeeper of data quality ensuring that all scientific data entering Mithrl becomes clean consistent and analysis-ready. Youll have outsized influence over the reliability and trustworthiness of our entire data AI stack.

  • High ownership & autonomy: this role is yours to shape. You decide how ingestion works define the standards build the pipelines. Youll work closely with our product data science and infrastructure teams shaping how data is ingested stored and exposed to end users or AI agents.

  • Team: Join a tight-knit talent-dense team of engineers scientists and builders

  • Culture: We value consistency clarity and hard work. We solve hard problems through focused daily execution

  • Speed: We ship fast (2x/week) and improve continuously based on real user feedback

  • Location: Beautiful SF office with a high-energy in-person culture

  • Benefits: Comprehensive PPO health coverage through Anthem (medical dental and vision) 401(k) with top-tier plans

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy so we urge you not to exclude yourself prematurely and to submit an application if youre interested in this work. We think AI systems like the ones were building have enormous social and ethical implications. We think this makes representation even more important and we strive to include a range of diverse perspectives on our team.


Required Experience:

IC

ABOUT MITHRLWe envision a world where novel drugs and therapies reach patients in months not years accelerating breakthroughs that save lives.Mithrl is building the worlds first commercially available AI Co-Scientista discovery engine that empowers life science teams to go from messy biological data...
View more view more

Key Skills

  • Apache Hive
  • S3
  • Hadoop
  • Redshift
  • Spark
  • AWS
  • Apache Pig
  • NoSQL
  • Big Data
  • Data Warehouse
  • Kafka
  • Scala

About Company

Company Logo

Generate bioinformatics in minutes.

View Profile View Profile