Job Description
The Data Technology and Engineering (DTE) Infrastructure team is expanding its Generative AI and large language model (LLM) capabilities and is looking for a senior principal engineer specialized in managing operational aspects of designing developing deploying and optimizing Gen AI solutions at scale and responsible for ongoing aspects of deploying managing and optimizing Large Language Models (LLM) in the Azure OpenAI infrastructure. They ensure that the models perform effectively are reliable and meet the needs of various applications and services. This will enable the Data Strategy and Solutions (DSS) team and other internal teams to build LLM powered solutions using the deployed models. Generative AI engineers who are passionate about designing developing deploying and optimizing Gen AI products at scale.
Responsibilities:
1. Model Deployment and Management
- Deploying Models:
- Manage the deployment of LLMs into production environments.
- Configure model serving infrastructures including APIs and endpoints.
- Model Versioning and Lifecycle Management:
- Maintain version control of models ensuring the correct models are in production.
- Plan and execute model updates and decommissioning of outdated models.
2. Performance Optimization
- Model Serving Efficiency:
- Optimize the inference performance of LLMs for low latency and high throughput.
- Implement techniques like model quantization pruning or distillation.
- Resource Optimization:
- Analyze and optimize compute resources required for model serving.
- Adjust configurations to balance performance and cost.
- Cost Management:
- Monitor costs associated with model training and serving.
- Identify opportunities to reduce expenses without compromising quality.
3. System Monitoring and Maintenance
- Monitoring Model Performance:
- Continuously track model accuracy response time and user feedback.
- Use monitoring tools to detect anomalies in model behavior.
- Maintenance Tasks:
- Schedule retraining and finetuning of models based on new data.
- Update models to address identified issues or improve performance.
4. Troubleshooting and Incident Response
- Resolving Technical Issues:
- Diagnose and resolve issues related to model outputs such as incorrect or biased responses.
- Debug modelrelated errors in production.
- Incident Management:
- Lead the response to incidents affecting model performance or availability.
- Document incidents and implement preventive measures.
- Disaster Recovery Planning:
- Develop strategies to recover models and data in case of system failures.
5. Collaboration with CrossFunctional Teams
- Working with Data Scientists and ML Engineers:
- Collaborate to understand model requirements and provide operational support.
- Assist in experiments and evaluate new models or features.
- Establish AI model governance standards and collaborate with crossfunctional teams
- Supporting Developers:
- Help application developers integrate LLM APIs into products and services.
- Oversee cloud and AI services ensuring robust CI/CD pipelines for continuous delivery.
- Provide technical guidance on best practices for using LLMs.
- Stakeholder Communication:
- Communicate model updates new features and performance metrics to stakeholders.
6. Supporting AI Ethics and Responsible AI Practices
- Ensuring Ethical AI Deployment:
- Implement policies to ensure models are used responsibly.
- Monitor for misuse of AI capabilities.
- Bias Detection and Mitigation:
- Develop and apply techniques to detect and reduce biases in model outputs.
- Collaborate on fairness and inclusivity initiatives.
- User Privacy:
- Ensure compliance with data privacy laws and regulations.
- Manage and protect sensitive information used in model training.
8. Documentation and Knowledge Sharing
- Creating Documentation:
- Document model architectures training processes and operational procedures.
- Maintain records of experiments and performance evaluations.
- Training and Mentoring:
- Share knowledge with team members about best practices in LLM operations.
- Provide training sessions on new tools or methodologies.
- Building Runbooks:
- Develop standard operating procedures for common tasks and incident responses.
9. Integrations and Data Pipeline Management
- Managing Data Workflows and Integrations:
- Set up and maintain data ingestion pipelines for training and finetuning LLMs.
- Ensure data is processed efficiently and securely.
- Set up integrations with external data sources.
- Data Preprocessing:
- Collaborate with DSS team to preprocess and clean datasets for model training.
- Implement data augmentation techniques to enhance model performance.
- Data Storage Solutions:
- Optimize storage solutions for training data
- Implement data retention policies and archiving strategies.
10. Continuous Improvement and Learning
- Staying Current with Technology Trends:
- Keep updated on advancements in LLMs and natural language processing.
- Attend workshops conferences and training sessions.
- Experimentation:
- Test new algorithms or techniques to improve model capabilities.
- Feedback Integration:
- Collect and analyze user feedback to enhance model performance.
11. Compliance with Best Practices and Standards
- Adhering to AI Development Standards:
- Follow industry best practices for AI model development and deployment.
- Quality Assurance:
- Implement testing frameworks to validate model outputs predeployment.
- Standardization:
- Establish standards for model development naming conventions and versioning.
12. Support GenAI/LLM off the shelf products like Microsoft 365 Copilot
- Develop best practices for M365 Copilot:
- Follow industry best practices and document them
- Day to day end user support:
- Assist end users with driving adoption training etc.
Knowledge and Skills:
- Strong understanding of GenAI / LLM Ops in the cloud preferably on the Microsoft Azure OpenAI infrastructure.
- Strong problemsolving and troubleshooting skills
- Ability to work on multiple concurrent projects and activities as both a lead and team member
- Able to reliably estimate level of effort needed for assignments and work within those parameters
- Able to work independently with minimal guidance
- Strong verbal and written communication skills organizational skills and attention to detail
- Demonstrated ability to collaborate in crossfunctional teams
Education and Experience:
- Bachelors degree in computer sciences discipline preferred and 10 years of professional experience in software or data engineering including at least 35 years in machine learning and GenerativeAI/LLM technology.
- Experience with the cutting edge of AI automation data and how to apply these capabilities to drive measurable impact including proven experience designing and scaling enterprisegrade AI/ML platforms with an emphasis on GenAI systems and workflow orchestration
- Strong understanding of GenAI design patterns and system components including RetrievalAugmented Generation (RAG) vector databases prompt orchestration and agentic frameworks
- Experience with LLMOps tools to implement guardrails track accuracy hallucinations bias and other metrics in Gen AI products
- Experience with leading technical solution design translating business requirements into technical specifications
- Experience in build deployment integration and scaling of AI and datafocused applications
- Experience leading agile crossfunctional technical teams to execute on technology infrastructure evolutions and custom applications as well as their supporting ongoing maintenance
- Proven ability to operate with a transparent mindset communicating openly with stakeholders at various levels of the organization
- Candidates with Cloud ML/GenAI/LLM Certifications Preferred.
Flex Designation:
HybridEligible or OnSite Eligible
Flex Eligibility Status
In this HybridEligible role you can choose to be designated as:
1. Hybrid: work remotely up to two days per week; or select
2. OnSite: work five days per week onsite with ad hoc flexibility.
Note: The Flex status for this position is subject to Vertexs Policy on Flex @ Vertex Program and may be changed at any time
Pay Range:
$0 $0
Disclosure Statement:
The range provided is based on what we believe is a reasonable estimate for the base salary pay range for this job at the time of posting. This role is eligible for an annual bonus and annual equity awards. Some roles may also be eligible for overtime pay in accordance with federal and state requirements. Actual base salary pay will be based on a number of factors including skills competencies experience and other jobrelated factors permitted by law.
At Vertex our Total Rewards offerings also include inclusive marketleading benefits to meet our employees wherever they are in their career financial family and wellbeing journey while providing flexibility and resources to support their growth and aspirations. From medical dental and vision benefits to generous paid time off (including a weeklong company shutdown in the Summer and the Winter) educational assistance programs including student loan repayment a generous commuting subsidy matching charitable donations 401(k) and so much more.
Flex Designation:
HybridEligible Or OnSite Eligible
Flex Eligibility Status:
In this HybridEligible role you can choose to be designated as:
1. Hybrid: work remotely up to two days per week; or select
2. OnSite: work five days per week onsite with ad hoc flexibility.
Note: The Flex status for this position is subject to Vertexs Policy on Flex @ Vertex Program and may be changed at any time.
Company Information
Vertex is a global biotechnology company that invests in scientific innovation.
Vertex is committed to equal employment opportunity and nondiscrimination for all employees and qualified applicants without regard to a persons race color sex gender identity or expression age religion national origin ancestry ethnicity disability veteran status genetic information sexual orientation marital status or any characteristic protected under applicable law. Vertex is an EVerify Employer in the United States. Vertex will make reasonable accommodations for qualified individuals with known disabilities in accordance with applicable law.
Any applicant requiring an accommodation in connection with the hiring process and/or to perform the essential functions of the position for which the applicant has applied should make a request to the recruiter or hiring manager or contact Talent Acquisition at