Senior Staff System Engineer, GPU Fleet

Coupand

Job Location:

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Company Introduction

We exist to wow our customers. We know were doing the right thing when we hear our customers say How did I ever live without Coupang Born out of an obsession to make shopping eating and living easier than ever were collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.

We are proud to have the best of both worlds a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang you will see yourself your colleagues your team and the company grow every day.

Our mission to build the future of commerce is real. We push the boundaries of whats possible to solve problems and break traditional Coupang now to create an epic experience in this always-on high-tech and hyper-connected world.

Role Overview

We are seeking a Sr Staff System Engineer GPU Fleet for our Coupang Intelligent Cloud (CIC) team to serve as the senior technical owner for our hyperscale GPU compute this role you will define fleet architecture drive reliability and automation at scale and lead the operation and evolution of GPU systems supporting largescale AI training and inference workloads. This is a handson stafflevel individual contributor role with broad technical ownership high operational impact and significant crossfunctional influence across hardware infrastructure and datacenter operations.

CIC builds the infrastructure for abundant intelligence. We partner with leading AI labs governments and enterprises to deliver hyperscale GPU compute with high reliability performance and efficiency. Our infrastructure supports some of the most demanding AI training and inference workloads in production today.

We operate with urgency deep ownership and a strong bias toward execution. Reliability operational excellence and rigorous systems engineering are core to our business.

What You Will Do

As a Sr Staff System Engineer GPU Fleet you will be the senior technical owner for CICs largescale GPU compute infrastructure. This is a handson senior individual contributor role with fleetlevel responsibility and broad crossfunctional influence.

You will define the technical direction for how GPU fleets are architected operated automated and evolved across multiple generations of hardware. Your work will directly affect fleet reliability operating efficiency scalability and customer success.

This role does not involve people management but it carries principallevel scope autonomy and decisionmaking authority across infrastructure hardware and operations.

Key Responsibilities:

Fleet Architecture & Technical Ownership

Own the endtoend technical architecture of hyperscale GPU fleets including hardware platform selection firmware strategy OS configuration drivers networking and observability.
Define and enforce technical standards and best practices for fleet reliability availability performance and operability.
Lead major fleetwide initiatives such as new GPU platform bringups multigeneration hardware transitions and architectural redesigns.
Evaluate tradeoffs across cost performance reliability and timetodeploy and make technically sound decisions under ambiguity.

Reliability Availability & Performance

Set and drive fleetlevel reliability availability and performance objectives.
Lead rootcause analysis and resolution of complex systemic failures affecting large portions of the fleet or multiple datacenters.
Identify recurring failure patterns and drive longterm fixes spanning hardware software automation and operational processes.
Work directly with hardware vendors and partners to resolve platformlevel issues and influence future hardware designs.

Automation & Systems Engineering

Design and build largescale automation systems for:

GPU fleet provisioning and lifecycle management
GPU health validation diagnostics and certification
Automated remediation recovery and replacement workflows

Eliminate manual operational toil through durable welldesigned tooling that scales with fleet growth.
Ensure all fleet systems are observable testable and resilient under failure conditions.

Operational Leadership

Act as a senior escalation point for critical production incidents impacting GPU availability or customer workloads.
Participate in oncall rotations with a strong emphasis on preventing future incidents not just responding to them.
Lead highseverity postincident reviews and ensure learnings are translated into concrete engineering and process improvements.

Technical Influence & Mentorship

Provide technical mentorship and guidance to system and infrastructure engineers across the organization.
Serve as a trusted technical partner to platform engineering networking datacenter operations and leadership teams.
Influence CICs longterm infrastructure roadmap through strong technical judgment and datadriven recommendations.

Basic Qualifications

12 Years of overall experience with at least 8 years of experience in Linux systems engineering infrastructure engineering or datacenter operations operating production environments with strict uptime and performance requirements.
Deep handson expertise in Linux system internals including process scheduling memory management filesystem behavior networking kernel behavior and system performance analysis.
Demonstrated experience operating hardwareintensive infrastructure in production including baremetal servers at scale.
Proven ability to debug complex issues across multiple system layers including hardware components firmware/BIOS kernel drivers OS configuration and userspace services.
Extensive experience writing productiongrade automation using Python and Bash for provisioning configuration management diagnostics remediation and fleet operations.
Strong understanding of how to design systems that are observable resilient and safe under failure rather than reliant on manual intervention.

Preferred Qualifications

Direct experience operating largescale GPU fleets supporting AI/ML training and/or inference workloads in production.
Familiarity with modern GPU platforms and ecosystems including GPU drivers CUDA NCCL and highperformance compute workloads.
Experience with highspeed interconnects and datacenter networking such as NVLink InfiniBand RDMA and highthroughput Ethernet.
Prior ownership of fleetwide or platformwide initiatives such as new hardware bringups major architectural changes or reliability transformations.
Experience partnering directly with hardware vendors or manufacturers to troubleshoot systemic issues or influence future platform designs.
Strong intuition for failure modes at scale including cascading failures correlated faults and secondorder effects across systems.
History of acting as a technical authority or escalation point for ambiguous highimpact production problems.
Ability to mentor engineers through design reviews technical problem solving and modelling strong operational ownership.
Experience participating in oncall rotations and responding to highseverity production incidents with clear ownership urgency and technical leadership.
Strong written and verbal communication skills including clear postincident reviews and technical documentation.

Type of work model

Hybrid

Details to consider

Those eligible for employment protection (recipients of veterans benefits the disabled etc.) may receive preferential treatment for employment in accordance with applicable laws.

Privacy Notice

Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below. Experience:
Staff IC

Company IntroductionWe exist to wow our customers. We know were doing the right thing when we hear our customers say How did I ever live without Coupang Born out of an obsession to make shopping eating and living easier than ever were collectively disrupting the multi-billion-dollar e-commerce indus...