Infra Support Engineer (GPU Clusters Linux) Remote
Job Location:
Petaling Jaya - Malaysia
Monthly Salary:
Not Disclosed
Posted on:
16 days ago
Vacancies:
1 Vacancy
Job Summary
Job Description:
We are a fast-growing AI infrastructure startup based in Silicon Valley and were looking for an Infra Support Engineer to join our global team.
This is a remote work-from-home role but you must be comfortable with 24/7 rotational shifts and on-call duties.
You will provide L1/L2 technical support for GPU clusters (NVIDIA) handle system delivery monitoring incident triage and escalation to SRE teams.
Responsibilities:
- Support AI infrastructure (GPU/CPU nodes networking storage orchestration) via tickets email Slack
- Assist GPU cluster delivery: provisioning imaging network validation BIOS/firmware updates GPU driver installation
- Monitor system health (alerts dashboards) and respond 24x7 as scheduled
- Triage incidents follow runbooks perform immediate mitigation
- Escalate to global SRE with clear logs/traces
- Maintain incident logs and stakeholder updates
- Perform routine operational tasks (log checks health/capacity checks simple automation)
- Participate in postmortems and improve SOPs
Requirements:
- Bachelors in CS or related field
- 2 years in IT operations server admin SRE DevOps or technical support
- Strong Linux admin experience (Ubuntu/RedHat/CentOS)
- Basic networking: TCP/IP DNS HTTP VLANs
- Familiar with monitoring tools (Prometheus Grafana AlertManager)
- Experience with NVIDIA GPU infrastructure and Kubernetes
- Able to read logs traces and follow runbooks under pressure
- Bilingual in English and Chinese (spoken and amp; written)
- Data centre experience is a big plus
- Willing to work shifts and on-call rotations