Job Description:
Build the systems that power large-scale machine learning training.
Were partnering with a highly respected AI research and engineering organization to hire a Full Stack Engineer focused on building end to end systems that support machine learning model training experimentation and evaluation - including work on large language models.
This role sits at the intersection of software engineering and applied ML. Youll design and ship internal tools used daily by researchers and engineers to move faster debug training issues and ship higher quality models.
What Youll Work On
- Build and scale internal ML workflow systems: dataset creation training orchestration experiment tracking evaluation and model/version management
- Contribute hands-on to model training and fine-tuning efforts (LLMs preferred other deep learning experience welcome)
- Develop backend services and APIs in Python supporting training and evaluation pipelines
- Build TypeScript-based UIs that allow users to launch runs compare experiments inspect metrics and debug failures
- Design efficient SQL schemas and queries with attention to performance and indexing tradeoffs
- Improve reliability and reproducibility of ML systems through testing CI/CD monitoring and safe rollouts
- Partner closely with ML researchers infrastructure and product teams to turn ambiguous needs into shipped systems
What Were Looking For
- 8 years of professional software engineering experience owning complex systems in production
- Strong full-stack experience across Python backend SQL data layers and TypeScript frontends
- Hands-on experience with machine learning training workflows (beyond calling model APIs)
- Experience training or fine-tuning ML models (LLMs a plus)
- Solid understanding of SQL performance fundamentals (query design indexing debugging slow queries)
- Comfortable operating in ambiguous fast moving environments with high ownership
Nice to Have
- Experience building internal ML developer tooling (experiment tracking eval frameworks model registries)
- Familiarity with training infrastructure concepts (orchestration checkpointing failure recovery)
- Exposure to LLM evaluation methods and quality metrics
- Startup or research adjacent engineering roles
Why This Role
- Work on cutting edge machine learning systems with real research impact
- High ownership and technical depth - not a narrow product role
- Collaborative team environment focused on quality speed and experimentation
Job Description: Build the systems that power large-scale machine learning training. Were partnering with a highly respected AI research and engineering organization to hire a Full Stack Engineer focused on building end to end systems that support machine learning model training experimentation ...
Job Description:
Build the systems that power large-scale machine learning training.
Were partnering with a highly respected AI research and engineering organization to hire a Full Stack Engineer focused on building end to end systems that support machine learning model training experimentation and evaluation - including work on large language models.
This role sits at the intersection of software engineering and applied ML. Youll design and ship internal tools used daily by researchers and engineers to move faster debug training issues and ship higher quality models.
What Youll Work On
- Build and scale internal ML workflow systems: dataset creation training orchestration experiment tracking evaluation and model/version management
- Contribute hands-on to model training and fine-tuning efforts (LLMs preferred other deep learning experience welcome)
- Develop backend services and APIs in Python supporting training and evaluation pipelines
- Build TypeScript-based UIs that allow users to launch runs compare experiments inspect metrics and debug failures
- Design efficient SQL schemas and queries with attention to performance and indexing tradeoffs
- Improve reliability and reproducibility of ML systems through testing CI/CD monitoring and safe rollouts
- Partner closely with ML researchers infrastructure and product teams to turn ambiguous needs into shipped systems
What Were Looking For
- 8 years of professional software engineering experience owning complex systems in production
- Strong full-stack experience across Python backend SQL data layers and TypeScript frontends
- Hands-on experience with machine learning training workflows (beyond calling model APIs)
- Experience training or fine-tuning ML models (LLMs a plus)
- Solid understanding of SQL performance fundamentals (query design indexing debugging slow queries)
- Comfortable operating in ambiguous fast moving environments with high ownership
Nice to Have
- Experience building internal ML developer tooling (experiment tracking eval frameworks model registries)
- Familiarity with training infrastructure concepts (orchestration checkpointing failure recovery)
- Exposure to LLM evaluation methods and quality metrics
- Startup or research adjacent engineering roles
Why This Role
- Work on cutting edge machine learning systems with real research impact
- High ownership and technical depth - not a narrow product role
- Collaborative team environment focused on quality speed and experimentation
View more
View less