Principal Architect AI

Arganteal

Not Interested
Bookmark
Report This Job

profile Job Location:

Los Angeles, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

Overview

The Principal Architect leadsHPCAI focused Professional Services delivery engagements and cross functional technical teams on customer programs or projects. Theyare responsible fortechnical communications with Engineers Architects and the customer forAI-drivenprojects. The Principal Architect mayparticipatein several Customer projects concurrently integrating AI solutions with enterprise IT systems.

Role Summary

The Principal Architectwill be at the epicenter of the AI revolution working with the most advanced hardware on the planet. Whetheryourehelping a research facility unlock new scientific breakthroughs or an enterprisetobuildits first private AI cloud your fingerprints will be on the infrastructure that defines the next decade of technology.

Theright person for the jobis a senior individual contributor responsible for designing implementing andoptimizinglarge-scale High-Performance Computing and AI platforms centered on the NVIDIA data center ecosystem. This roleoperatesin a hybrid capacity combininghands-on technical architecture with selective customer-facing advisory responsibilities.

The architect serves as a technical authority across GPU-accelerated compute high-performance networking and modern parallel storage platforms influencing architectural standards and delivery outcomes while ensuring successful on-time and on-budget customer deployments without escalations.

This is a remotework from homeposition with an average travel expectation of approximately 10% and a willingnessforadditionaltravel during peak project phases or critical customer engagements.


Key Responsibilities
Architecture and Design
  • Lead the end-to-end architecture of GPU-accelerated HPC and AI platforms including greenfield AI factory designs and optimization of existing HPC environments.
  • Architect integrated solutions spanningComputeNetworking andStorage using NVIDIAHGX and DGX platforms Grace CPUarchitectures Spectrum-X networking andhigh-performance parallel storage systems.
  • Design storage architecturesoptimizedfor AI training inference and HPC workloadsbalancingperformance scalability resiliency and cost.
  • Define reference architectures design patterns and best practices for repeatable and supportable customer deployments.

PlatformImplementation and Optimization
  • Provide hands-on technical leadership during implementation phases including cluster bring-up performance tuning andworkload optimization.
  • Architect and integrate workload orchestration and scheduling platforms using NVIDIA Base Command ManagerSlurmKubernetesandRun:AI.
  • Optimizeend-to-end data pipelines including GPUutilization storage throughput metadata performance and job scheduling efficiency.
  • Troubleshoot performance bottlenecks acrossComputeNetworking andStorage.

Storage Architecture & Data Performance
  • Design andvalidatehigh-performance storage solutions using modern parallel and scale-out storage platforms.
  • Demonstrate hands-on experience with at least one of the following storage technologies
  • VAST Data
  • WEKA
  • DDN
  • Lustre
  • NetApp
  • Architect storage solutions that support demanding AI and HPC workloads including high-throughput training pipelines checkpointing and large-scale shared datasets.
  • Collaborate with compute and networking design to ensure balanced bottleneck-free architectures.

TechnicalAuthority and Advisory
  • Act as a senior technical authority for HPC and AI architecture across internal teams and customer engagements.
  • Participate selectively in customer-facing discussions tovalidatearchitecture and delivery plans withaprimaryfocus on design integrity and execution rather than pre-sales.
  • Influence platform standards architectural direction and technical decision-making throughexpertiseanddemonstratedexecution.

Delivery Excellence
  • Identifytechnical risks early acrossComputeNetworkingStorage and orchestration layers and drive mitigation strategies.
  • Partner with the PMO counterpart to resolve Risks and Issues upon identificationandto ensure production-ready supportable platforms.
  • Ensure staff contractors and partners adhere to best practices and templates for AI solution delivery.
  • Reviewdeploymentdocuments technical assessments and other outputs to ensure consistency and accuracy aligning with AI and One Voice standards.

Required Technical Expertise

Core Mastery Areas
  • Expert levelwith deep architectural knowledge ofNVIDIA data center platforms including HGX and DGXplatforms.
  • GPU-acceleratedcomputearchitecture for AI and HPC workloads.
  • High-performance networking architectures especially withSpectrum-X.
  • Large-scale AI factory and HPC platform design.

Storage Expertise
  • Hands-on architectural experience with high-performance parallel or scale-out storage systems.
  • Deep understanding of storage performance characteristics relevant to AI and HPC workloads including bandwidth IOPS latency and metadata scaling.
  • Proven experience integrating storage platforms such as VAST DataNetAppWEKA DDN orLustreinto GPU-accelerated environments.

Working Proficiency
  • NVIDIA Base Command Manager(BCM)for cluster lifecycle management and operations.
  • Slurmfor HPC workload scheduling and resource management.
  • Run:AIfor GPU orchestration and multi-tenant AI workload optimization.
  • Kubernetesadministrationincluding deploying and managing GPU-acceleratedAI and HPC workloads.
  • Linux systems administration in large-scale performance-sensitive environments.
  • Containerized AI workflows and their interaction with schedulers and storage systems.

Additional Experience
  • Experienceoptimizingexisting HPC or AI platforms for performanceutilization and cost efficiency.
  • Prior experience with multi-site air-gapped or regulated environments is beneficial but notrequired.
  • Experience withliquid cooling power/cooling design and data center integrationstronglypreferred.

Leadership & Influence
  • Senior individual contributor role with influence through technical authority rather than people management.
  • Ability to mentor engineers and architects through design reviews architectural guidance and technical leadership.
  • Comfortable operating autonomously in complex high-impact technical environments.

Documentation & Repeatability Expectations
  • Develop and maintainhighqualityarchitectural documentation including design blueprints configuration guides deployment validation reports and operational runbooks.
  • Ensure all technical artifacts meet One Voicestandards for clarity completeness and technical accuracy enabling consistent delivery across teams.
  • Createreusable templates reference architectures and standardized design patternsthataccelerate future projects and improve delivery quality.
  • Drive a culture of documentation discipline ensuring thatevery deployment is reproducible supportable and aligned with architectural intent.

Educational/Experience Requirements
  • Bachelors degree in a technical fieldor equivalenthands-onexperiencearchitecting large scale HPC or AIsystems onexperiencearchitecting largescale HPC or AI systems.
  • Advanced degree (MS/PhD) in relevant fields is a plus butnotrequired.
  • Experience:10 years in HPC Data Center Architecture and/or Systems Engineering.
  • Bare Metal Focus: A fundamental preference for and understanding of on-premises hardware constraints (power cooling cabling).
  • Proven experience as a Senior orLead Architect or equivalent experience in AI projects.


Required Experience:

Staff IC

OverviewThe Principal Architect leadsHPCAI focused Professional Services delivery engagements and cross functional technical teams on customer programs or projects. Theyare responsible fortechnical communications with Engineers Architects and the customer forAI-drivenprojects. The Principal Architec...
View more view more

About Company

Company Logo

Arganteal helps teams automate deployments, reduce errors, and scale IT operations with full-stack DevOps automation solutions and cloud platform operations.

View Profile View Profile