LLMOps Engineer
Job Summary
- Lead LLM infrastructure efforts across multiple engineering teams ensuring scalable secure and efficient delivery of AI-powered features.
- Design build and maintain production-grade systems for deploying and managing LLMs including versioning A/B testing and rollback strategies.
- Collaborate with the AI team to implement prompt management systems prompt versioning and token optimization strategies.
- Monitor and optimize inference latency throughput caching strategies and multi-provider cost management (OpenAI Anthropic AWS Bedrock etc.).
- Develop observability pipelines including quality metrics evaluation workflows error monitoring and user feedback loops.
- Implement and maintain Retrieval-Augmented Generation (RAG) systems embedding pipelines and vector database operations.
- Support fine-tuning workflows and manage model registries for both proprietary and open-source models.
- Implement AI safety guardrails content filtering and compliance measures to ensure responsible deployment.
- Support general DevOps initiatives 10% of the time including CI/CD improvements and cloud infrastructure updates.
- Maintain thorough documentation of all LLM infrastructure processes and best practices.
- 3 years of experience in LLMOps MLOps or similar production-focused AI/ML roles.
- Strong Python programming skills and familiarity with LLM libraries and frameworks.
- Hands-on experience with LLM providers (OpenAI Anthropic AWS Bedrock Azure Vertex Databricks).
- Experience with vector databases such as Pinecone Weaviate Qdrant or Chroma.
- Knowledge of model serving tools (vLLM TGI Ray Serve).
- Proficiency with Docker Kubernetes and cloud environments (AWS preferred).
- Familiarity with prompt engineering token optimization chain-of-thought approaches and evaluation metrics.
- Experience with LLM-specific tooling (LangSmith Weights & Biases Phoenix MLflow).
- Ability to troubleshoot LLM issues such as latency improvements hallucination mitigation and context window strategies.
- Strong communication skills with both technical and non-technical stakeholders.
- Experience with open-source LLMs (Llama Mistral etc.).
- Knowledge of advanced RAG techniques including hybrid search and re-ranking.
- Exposure to agent frameworks and real-time LLM applications.
- Background in traditional MLOps data engineering or multimodal models.
- Experience with Ruby on Rails.
- Understanding of AI safety and alignment principles.
- Fast-paced high-trust environment with significant ownership.
- Opportunity to shape the foundation of Thrives AI infrastructure from day one.
- Strong career progression and mentorship opportunities.
- 3 weeks paid vacation 1-week holiday shutdown
- Health insurance & wellness coverage
- Yearly Learning & Development Allowance
- Yearly Workspace Allowance
Key Skills
About Company
Thrive powers work teams of all sizes across the talent journey, supporting HR leaders at over 500 global enterprises, next-gen growth firms, and the public sector to navigate organizational change and workforce transitions. From facilitating layoffs and redeployments to providing acc ... View more