drjobs Senior AI Infrastructure Engineer

Senior AI Infrastructure Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Beijing - China

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Responsibilities

1. Full-Stack AI Infrastructure Architecture & Development:

  • Build a full-stack AI infrastructure system for quantitative scenarios based on Kubernetes unifying the management of heterogeneous computing resources (e.g. GPU pooling).
  • Integrate high-performance communication layers (e.g. RDMA) and drive the unified development of AI training/inference platforms and GPU operation/maintenance platforms.
  • Streamline the end-to-end workflow from resource scheduling to model deployment enhancing system efficiency and stability.

2. Intelligent Computing Power Scheduling System Design:

  • Design a global scheduling mechanism supporting multi-task types and priority strategies leveraging Volcano scheduler capabilities.
  • Lead the customization and maintenance of Volcano and core Operators optimizing elastic scaling and resource utilization based on dynamic demands of quantitative tasks.

3. Hardware-Software Co-Optimization & System Reliability:

  • Develop an intermediate layer bridging underlying hardware (GPU/networking/storage) and AI frameworks (PyTorch/TensorFlow).
  • Build GPU elastic resource pools fault self-healing mechanisms and unified observability platforms (e.g. monitoring dashboards).
  • Ensure high-efficiency iteration and high availability of large-scale model training through performance tuning and automated operations.

4. Technical Foresight & Architecture Evolution:

  • Drive long-term AI Infra roadmap planning anticipating quantitative business needs in computing scale training efficiency and cost control.
  • Explore and validate cutting-edge architectures (e.g. heterogeneous computing fusion compute-storage separation Serverless AI) to enhance infrastructure capabilities and technical barriers.

Qualifications

1. Bachelors/Masters in Computer Science or related fields 5-10 years of experience with strong self-motivation and execution ability to identify and resolve technical bottlenecks.

2. Deep expertise in AI infrastructure: Kubernetes GPU resource management RDMA/high-performance networking and large-scale distributed AI system design/deployment.

3. Proficient in *Golang/Python* with solid system programming and automation skills. Priority given to candidates with experience in *Volcano/Kueue schedulers K8s Operator development or open-source contributions*.

4. Familiar with core resource scheduling principles GPU lifecycle management (allocation isolation elasticity fault tolerance) and designing high-availability low-latency strategies for quantitative tasks.

5. Knowledge of mainstream AI frameworks (PyTorch/TensorFlow) with experience in training/inference performance optimization and cross-team collaboration for framework-infra co-optimization.

6. Preferred: Experience in **FinTech/quantitative AI infrastructure* understanding of business-critical computing demands and ability to drive cross-team collaboration and value delivery.

Employment Type

Full Time

Company Industry

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.