Machine Learning Engineer, Core Data
San Francisco, CA - USA
Department:
Job Summary
About Cantina:
Cantina Labs is a social AI company developing a suite of advanced real-time models that push the boundaries of expression personality and realism. We bring characters to life transforming how people tell stories connect and create. We build and power ecosystems. Cantina our flagship social AI platform is just the beginning.
If youre excited about the potential AI has to shape human creativity and social interactions join us in building the future!
About the Role:
Were looking for an ML Engineer focused on Data Quality to own the datasets that power our speech systems. You will be hands-on with audio and text data: auditing denoising filtering labeling and building the tooling and models that turn messy large-scale data into reliable training corpora for TTS and adjacent tasks. Youll develop data quality metrics and classifiers run human-in-the-loop annotation programs and integrate quality gates into our training and evaluation pipelines. Your work will directly improve model performance robustness and cost by driving the model data eval flywheel from the data side.
What Youll Do:
Dataset ownership: define specs; audit and curate large-scale audio/text; close corpus gaps and fix sample-level issues.
Quality instrumentation: build automated gates/metrics (e.g. SNR clipping VAD WER SV/LID safety) with dashboards; validate against listening tests.
Classifiers and filters: train lightweight models to tag score and filter data (VAD ASR gating LID SV/diarization noise/safety); calibrate to subjective outcomes.
Cleaning and integrity: apply denoise/dereverb/de-clip when beneficial; deduplicate and decontaminate; prevent leakage; maintain lineage and versioned releases.
Data selection: optimize mixtures via sampling weighting curriculum and active learning; mine hard negatives and long-tail cases.
Tooling and pipelines: ship reproducible ETL and validation; integrate quality gates into training/eval; add monitoring and alerts.
Human-in-the-loop and compliance: run MTurk/vendor annotation with strong QC; ensure consent/licensing/policy compliance; collaborate across teams and document datasets.
What Youll Bring:
Strong experience building ML-driven data quality systems for audio/speech or equivalent data-centric ML experience with a track record of improving model outcomes via better data.
Proficient in Python and PyTorch; training/finetuning SSL-ASR (Whisper Wav2Vec BERT) models CNN based classifiers and writing robust production code.
Audio/speech fundamentals: torchaudio/librosa/ffmpeg spectrogram features (e.g. log-mel MFCC) VAD/SAD basic DSP and audio QA.
Scalable data engineering skills: Spark/Beam or similar SQL Airflow or equivalent orchestration and cloud storage/computing (AWS/GCP).
Familiarity with ASR/TTS metrics and tooling: WER MOS/MOSNet PESQ/STOI/ViSQOL speaker verification (EER) diarization language ID.
Experience with dataset validation versioning and experiment tracking; comfort debugging data issues from single samples to fleet-wide trends.
Ability to balance rigor with speed and to translate ambiguous requirements into measurable data improvements.
Preferred Experience:
Shipped datasets and/or data quality tooling that moved the needle for TTS/ASR/VC in production.
Built and deployed classifiers for LID SV/diarization VAD noise/glitch detection or safety/content moderation for audio.
Ran crowdsourcing/vendor annotation at scale with strong quality control (honeypots IAA label aggregation).
Background in de-noising/enhancement and their effects on downstream TTS quality.
Contributions to open-source or publications in speech/audio/ML.
Experience with data governance consent tracking and policy enforcement.
Required Experience:
IC
About Company
A new social platform where you can create, share, and interact with Al bots live with friends.