AI platform engineer

Roche

Job Location:

Shanghai - China

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

At Roche you can show up as yourself embraced for the unique qualities you bring. Our culture encourages personal expression open dialogue and genuine connections where you are valued accepted and respected for who you are allowing you to thrive both personally and professionally. This is how we aim to prevent stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche where every voice matters.

The Position

AI Platform Engineer

Role Overview

Own the full lifecycle of a production AI/ML platform spanning on-prem GPU clusters multi-cloud infrastructure (AWS Alibaba Cloud) and service delivery. This role bridges datacenter hardware platform engineering and AI service operations in a GxP-regulated pharmaceutical environment. You will work closely with global engineering teams solution architects and business stakeholders across time zones.

Key Responsibilities

Infrastructure Engineering (On-Prem & Cloud)

Own OS baseline: REDHAT Satellite management custom Base ISO lifecycle

Integration with enterprise storage systems which are managed by the Roche Storage team

GPU server BOM selectionand hardware qualification

Architect cloud resource strategy: Reserved Instance planning cost optimization across AWS and Alibaba Cloud

Cloud Accounts (AWS and Alibaba) Post Previsioning configuration and management for Platform and Platform managed use case accounts

Infrastructure as Code (IaC)

Develop and maintain Ansible scripts for automated server management (Provision Decommission Configuration)

Build and operate AMI Bakery pipelines for immutable image delivery

Orchestrate multi-cloud server deployments (AWS Alibaba Cloud) via IaC

Automate Kubernetes cluster provisioning and management

Develop and harden custom IAC Scripts

MLOps Platform Engineering

Manage full cluster lifecycle: provisioning upgrades scaling disaster recovery

Own 30 platform components across the following domains:

GPU & Device Management

AI Workload Orchestration: Engineering for Kubernetes Scheduling and SLURM Scheduling

Networking: Connectivity Engineering inside Kubernents Cluster

Storage: Storage integration with multiple types including object storage and block storage.

Observability: Design and implement observability dashboards via Prometheus Grafana OpenTelemetry etc.

Security & PKI: PKI management for the entire platform. Implement dev-sec ops practice into dev-ops lifecycle

Platform Engineering on Data configuration API Training/Influencing frameworks Pipelines and Toolsets

Build and maintain CI/CD pipelines; Build and maintain Github/Gitlab templates Support AI Use Cases on Engineering tasks.

Accountable for troubleshooting platform related issues including leading the troubleshooting across different services that belong to different teams.

AI Platform Services

Deploy and operate AI Gateway (Portkey Data Plane) with full IaC coverage

Execute on-prem model lifecycle management

Develop and maintain workspace auto provisioning scripts

Integrate AI safety guardrails

Build and implement FinOps process

Support AI Use Cases on Engineering tasks.

Accountable for troubleshooting platform related issues including leading the troubleshooting across different services that belong to different teams.

Compliance & Process

Author and maintain system design documents

Documentation and approval workflow management (via Veeva Quality doc Markdown in Gitlab for project documents runbooks and user manuals.)

Manage workloads in Jira

Requirements

Must Have Technical

8 years in production Linux systems engineering with deep RHEL expertise (Satellite Kickstart custom ISO builds)

5 years operating Kubernetes in production at scale (500 nodes or 5000 pods) including cluster lifecycle management Disaster recovery and multi-tenant isolation

Expert-level IaC proficiency: Ansible (custom module/plugin development) Terraform (provider development state management at scale)

Hands-on GPU cluster experience: NVIDIA driver lifecycle MIG/vGPU partitioning CUDA compatibility matrix management GPU health monitoring

Strong networking fundamentals: L2/L3 design VLAN segmentation BGP basics IPAM at datacenter scale; experience with high-performance fabrics (RDMA/RoCE/InfiniBand) for distributed training

Deep AWS experience (VPC architecture EC2 placement groups EKS IAM policy design) with production workloads

Helm Chart development: authoring complex charts with subcharts hooks and CRD lifecycle management

CI/CD pipeline ownership: end-to-end container image build vulnerability scanning artifact promotion and GitOps-based deployment

Business-level English proficiency (written and spoken).

Cross-functional collaboration. You will work at the intersection of infrastructure security compliance and data science teams. Ability to translate between deeply technical infrastructure concerns and business/compliance requirements is essential not just teamwork but the ability to drive alignment across groups with competing priorities.

You need to lead troubleshooting in real-time communicate status clearly to stakeholders and conduct blameless post-mortems afterward.

Experience operating and customizing AI/ML serving platforms (Seldon Core KServe/KFServing Triton Inference Server) in production

Service mesh expertise (Istio: traffic management mTLS authorization policies) at scale

Full-stack observability design: Prometheus federation Grafana dashboard-as-code ELK/OpenSearch log pipelines OpenTelemetry instrumentation

Production experience with multi-cloud orchestration (AWS Alibaba Cloud) including cross-cloud networking and unified IaC

Familiarity with GxP/CSV compliance in pharmaceutical or life sciences change control validation protocols audit trail requirements

Experience with AI Gateway / LLM routing systems (Portkey LiteLLM or equivalent)

FinOps practice: GPU cost modeling chargeback/showback implementation Could Resource cost optimization

Contributions to open-source infrastructure projects (CNCF ecosystem preferred)

Experience mentoring junior engineers or leading small infrastructure teams (25 people)

Track record of building internal developer platforms or self-service tooling for ML/data science teams

Nice to Have

Familiarity with pharmaceutical IT service management (ServiceNow ITSM VeevaQuality Doc)

Prior experience in a platform team serving internal ML/data science customers (100 users)

Who we are

A healthier future drives us to innovate. Together more than 100000 employees across the globe are dedicated to advance science ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities foster creativity and keep our ambitions high so we can deliver life-changing healthcare solutions that make a global impact.

Lets build a healthier future together.

Roche is an Equal Opportunity Employer.

Required Experience: