Senior Principal Engineer - GenAI / LLM Infrastructure Ops

Vertex Pharmaceuticals

Posted on : 09-05-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Boston - USA

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 09-05-2025

Job Description

Job Description

The Data Technology and Engineering (DTE) Infrastructure team is expanding its Generative AI and large language model (LLM) capabilities and is looking for a senior principal engineer specialized in managing operational aspects of designing developing deploying and optimizing Gen AI solutions at scale and responsible for ongoing aspects of deploying managing and optimizing Large Language Models (LLM) in the Azure OpenAI infrastructure. They ensure that the models perform effectively are reliable and meet the needs of various applications and services. This will enable the Data Strategy and Solutions (DSS) team and other internal teams to build LLM powered solutions using the deployed models. Generative AI engineers who are passionate about designing developing deploying and optimizing Gen AI products at scale.

Responsibilities:

1. Model Deployment and Management

Deploying Models:
- Manage the deployment of LLMs into production environments.
- Configure model serving infrastructures including APIs and endpoints.
Model Versioning and Lifecycle Management:
- Maintain version control of models ensuring the correct models are in production.
- Plan and execute model updates and decommissioning of outdated models.

2. Performance Optimization

Model Serving Efficiency:
- Optimize the inference performance of LLMs for low latency and high throughput.
- Implement techniques like model quantization pruning or distillation.
Resource Optimization:
- Analyze and optimize compute resources required for model serving.
- Adjust configurations to balance performance and cost.
Cost Management:
- Monitor costs associated with model training and serving.
- Identify opportunities to reduce expenses without compromising quality.

3. System Monitoring and Maintenance

Monitoring Model Performance:
- Continuously track model accuracy response time and user feedback.
- Use monitoring tools to detect anomalies in model behavior.
Maintenance Tasks:
- Schedule retraining and finetuning of models based on new data.
- Update models to address identified issues or improve performance.

4. Troubleshooting and Incident Response

Resolving Technical Issues:
- Diagnose and resolve issues related to model outputs such as incorrect or biased responses.
- Debug modelrelated errors in production.
Incident Management:
- Lead the response to incidents affecting model performance or availability.
- Document incidents and implement preventive measures.
Disaster Recovery Planning:
- Develop strategies to recover models and data in case of system failures.

5. Collaboration with CrossFunctional Teams

Working with Data Scientists and ML Engineers:
- Collaborate to understand model requirements and provide operational support.
- Assist in experiments and evaluate new models or features.
- Establish AI model governance standards and collaborate with crossfunctional teams
Supporting Developers:
- Help application developers integrate LLM APIs into products and services.
- Oversee cloud and AI services ensuring robust CI/CD pipelines for continuous delivery.
- Provide technical guidance on best practices for using LLMs.
Stakeholder Communication:
- Communicate model updates new features and performance metrics to stakeholders.

6. Supporting AI Ethics and Responsible AI Practices

Ensuring Ethical AI Deployment:
- Implement policies to ensure models are used responsibly.
- Monitor for misuse of AI capabilities.
Bias Detection and Mitigation:
- Develop and apply techniques to detect and reduce biases in model outputs.
- Collaborate on fairness and inclusivity initiatives.
User Privacy:
- Ensure compliance with data privacy laws and regulations.
- Manage and protect sensitive information used in model training.

8. Documentation and Knowledge Sharing

Creating Documentation:
- Document model architectures training processes and operational procedures.
- Maintain records of experiments and performance evaluations.
Training and Mentoring:
- Share knowledge with team members about best practices in LLM operations.
- Provide training sessions on new tools or methodologies.
Building Runbooks:
- Develop standard operating procedures for common tasks and incident responses.

9. Integrations and Data Pipeline Management

Managing Data Workflows and Integrations:
- Set up and maintain data ingestion pipelines for training and finetuning LLMs.
- Ensure data is processed efficiently and securely.
- Set up integrations with external data sources.
Data Preprocessing:
- Collaborate with DSS team to preprocess and clean datasets for model training.
- Implement data augmentation techniques to enhance model performance.
Data Storage Solutions:
- Optimize storage solutions for training data
- Implement data retention policies and archiving strategies.

10. Continuous Improvement and Learning

Staying Current with Technology Trends:
- Keep updated on advancements in LLMs and natural language processing.
- Attend workshops conferences and training sessions.
Experimentation:
- Test new algorithms or techniques to improve model capabilities.
Feedback Integration:
- Collect and analyze user feedback to enhance model performance.

11. Compliance with Best Practices and Standards

Adhering to AI Development Standards:
- Follow industry best practices for AI model development and deployment.
Quality Assurance:
- Implement testing frameworks to validate model outputs predeployment.
Standardization:
- Establish standards for model development naming conventions and versioning.

12. Support GenAI/LLM off the shelf products like Microsoft 365 Copilot

Develop best practices for M365 Copilot:
- Follow industry best practices and document them
Day to day end user support:
- Assist end users with driving adoption training etc.

Knowledge and Skills:

Strong understanding of GenAI / LLM Ops in the cloud preferably on the Microsoft Azure OpenAI infrastructure.
Strong problemsolving and troubleshooting skills
Ability to work on multiple concurrent projects and activities as both a lead and team member
Able to reliably estimate level of effort needed for assignments and work within those parameters
Able to work independently with minimal guidance
Strong verbal and written communication skills organizational skills and attention to detail
Demonstrated ability to collaborate in crossfunctional teams

Education and Experience:

Bachelors degree in computer sciences discipline preferred and 10 years of professional experience in software or data engineering including at least 35 years in machine learning and GenerativeAI/LLM technology.
Experience with the cutting edge of AI automation data and how to apply these capabilities to drive measurable impact including proven experience designing and scaling enterprisegrade AI/ML platforms with an emphasis on GenAI systems and workflow orchestration
Strong understanding of GenAI design patterns and system components including RetrievalAugmented Generation (RAG) vector databases prompt orchestration and agentic frameworks
Experience with LLMOps tools to implement guardrails track accuracy hallucinations bias and other metrics in Gen AI products
Experience with leading technical solution design translating business requirements into technical specifications
Experience in build deployment integration and scaling of AI and datafocused applications
Experience leading agile crossfunctional technical teams to execute on technology infrastructure evolutions and custom applications as well as their supporting ongoing maintenance
Proven ability to operate with a transparent mindset communicating openly with stakeholders at various levels of the organization
Candidates with Cloud ML/GenAI/LLM Certifications Preferred.

Flex Designation:

HybridEligible or OnSite Eligible

Flex Eligibility Status

In this HybridEligible role you can choose to be designated as:
1. Hybrid: work remotely up to two days per week; or select
2. OnSite: work five days per week onsite with ad hoc flexibility.

Note: The Flex status for this position is subject to Vertexs Policy on Flex @ Vertex Program and may be changed at any time

Pay Range:

$0 $0

Disclosure Statement:

The range provided is based on what we believe is a reasonable estimate for the base salary pay range for this job at the time of posting. This role is eligible for an annual bonus and annual equity awards. Some roles may also be eligible for overtime pay in accordance with federal and state requirements. Actual base salary pay will be based on a number of factors including skills competencies experience and other jobrelated factors permitted by law.

At Vertex our Total Rewards offerings also include inclusive marketleading benefits to meet our employees wherever they are in their career financial family and wellbeing journey while providing flexibility and resources to support their growth and aspirations. From medical dental and vision benefits to generous paid time off (including a weeklong company shutdown in the Summer and the Winter) educational assistance programs including student loan repayment a generous commuting subsidy matching charitable donations 401(k) and so much more.

Flex Designation:

HybridEligible Or OnSite Eligible

Flex Eligibility Status:

Note: The Flex status for this position is subject to Vertexs Policy on Flex @ Vertex Program and may be changed at any time.

Company Information

Vertex is a global biotechnology company that invests in scientific innovation.

Vertex is committed to equal employment opportunity and nondiscrimination for all employees and qualified applicants without regard to a persons race color sex gender identity or expression age religion national origin ancestry ethnicity disability veteran status genetic information sexual orientation marital status or any characteristic protected under applicable law. Vertex is an EVerify Employer in the United States. Vertex will make reasonable accommodations for qualified individuals with known disabilities in accordance with applicable law.

Any applicant requiring an accommodation in connection with the hiring process and/or to perform the essential functions of the position for which the applicant has applied should make a request to the recruiter or hiring manager or contact Talent Acquisition at

Required Experience:

Staff IC

Employment Type

Full-Time

Company Industry

Key Skills

Apply Now

About Company

Vertex Pharmaceuticals

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Senior Principal Engineer - GenAI / LLM Infrastructure Ops

Vertex Pharmaceuticals

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Plastic Product Design Engineer /Sr. Design Engineer

NOC Engineer

NOC Engineer

Manufacturing Engineer

Senior Java Fullstack Engineer

Senior Manager Ã¢â¬â Renewable Energy Projects (Solar & Wind)

Cloud Infrastructure Engineer GCP & CloudBees CI/CD

Packaging Engineer