Solution Architect AI Infrastructure & Private Cloud
Job Summary
PS- Global Competency Center
Hewlett Packard Enterprise
Job Title - Solution Architect AI infrastructure & Private Cloud
Job Description:
We are seeking an experiencedSolution Architect AI infrastructure &
Private Cloud with deep expertise inAI/ML infrastructureHigh
Performance Computing (HPC) andcontainer platformsto join our dynamic
team focused on deliveringHPE Private Cloud AIandEnterprise AI
Factory Solutions. This role is instrumental in architecting deploying
and optimizing private cloud environments that leverage HPEs
co-developed solutions with NVIDIA as well as validated HPE reference
architectures to support enterprise-grade AI workloads at scale.
The ideal candidate will bring strong technical expertise inAI
infrastructurecontainer orchestration platforms andhybrid cloud
environments and will play a key role in deliveringscalable secure
and high-performance AI platform solutionspowered byHPE
GreenLakeandNVIDIA AI Enterprisetechnologies.
Key Responsibilities:
1. Leadership and Strategy:
- Provide delivery assurance and serve as the lead design
authority to ensure seamless execution of Enterprise grade
container platform including Red Hat OpenShift and SUSE
Rancher HPE Private Cloud AI and HPC/AI solutions fully
aligned with customer AI/ML strategies and business objectives.
- Align solution architecture with NVIDIA Enterprise AI Factory
design principles including modular scalability GPU
optimization and hybrid cloud orchestration.
- Oversee planning risk management and stakeholder alignment
throughout the project lifecycle to ensure successful outcomes.
2. Solution Planning and Design:
- Architect and optimize end-to-end solutions across container
orchestration and HPC workload management domains leveraging
platforms such as Red Hat OpenShift SUSE Rancher and/or
workload schedulers like Slurm and Altair PBS Pro.
- Ensure seamless integration of container and AI platforms with
the broader software ecosystem including NVIDIA AI Enterprise
as well as open-source DevOps AI/ML tools and frameworks.
3. Opportunity assessment:
- Lead technical responses to RFPs RFIs and customer inquiries
ensuring alignment with business and technical requirements.
- Conduct proof-of-concept (PoC) engagements to validate solution
feasibility performance and integration within customer
environments.
- Assess customer infrastructure and workloads to recommend
optimal configurations using validated reference architectures
from HPE and strategic partners such as Red Hat NVIDIA SUSE
along with components from the open-source ecosystem.
4. Innovation and Research:
- Stay current with emerging technologies industry trends and
best practices across HPC Kubernetes container platforms
hybrid cloud and security to inform solution design and
innovation.
5. Customer-centric mindset:
- Act as a trusted advisor to enterprise customers ensuring
alignment of AI solutions with business goals.
- Translate complex technical concepts into value propositions for
stakeholders
6. Team Collaboration:
- Collaborate with cross-functional teams including subject
matter experts in infrastructure componentssuch as HPE servers
storage networkingand data science teams to ensure cohesive
and integrated solution delivery.
- Mentor technical consultants and contribute to internal
knowledge sharing through tech talks and innovation forums.
Required Skills:
1. HPC & AI Infrastructure
- Extensive knowledge of HPC technologiesand workload scheduler such
as Slurmand/orAltair PBS Pro
- Proficient in HPC cluster management tools including HPE Cluster
Management (HPCM) and/or NVIDIA Base Command Manager.
- Experience with HPC cluster managers like HPE Cluster Management
(HPCM) and/or NVIDIA Base Command Manager.
- Good understanding with high-speed networking stacks (InfiniBand
Mellanox) and performance tuning of HPC components.
- Solid grasp of high-speed networking technologies such
asInfiniBandandEthernet.
2. Containerization & Orchestration
- Extensive hands-on experience with containerization technologies
such as Docker Podman and Singularity
- Proficiency with at least two container orchestration platforms:
CNCF Kubernetes Red Hat OpenShift SUSE Rancher (RKE/K3S)
Canonical Charmed Kubernetes.
- Strong understanding of GPU technologies including the NVIDIA GPU
Operator for Kubernetes-based environments and DCGM (Data Center GPU
Manager) for GPU health and performance monitoring.
Systems & Virtualization
- Extensive experience in Linux system administration including
package management boot process troubleshooting performance
tuning and network configuration.
- Proficient with multiple Linux distributions with hands-on
expertise in at least two of the following: RHEL SLES and Ubuntu.
- Experience with virtualization technologies including KVM and
OpenShift Virtualization for deploying and managing virtualized
workloads in hybrid cloud environments.
4. Cloud DevOps & MLOps
- Solid understanding of hybrid cloud architectures and experience
working with major cloud platforms in conjunction with on-premises
infrastructure.
- Familiarity with DevOps practices including CI/CD pipelines
infrastructure as code (IaC) and microservices-based application
delivery.
- Experience integrating and operationalizing open-source AI/ML tools
and frameworks supporting the full model lifecycle from development
to deployment.
- Good understanding of cloud-native security observability and
compliance frameworks ensuring secure and reliable AI/ML operations
at scale.
5. Networking & Protocols
- Strong understanding of core networking principles including DNS
TCP/IP routing and load balancing essential for designing
resilient and scalable infrastructure.
- Working knowledge of key network protocols such as S3 NFS and
SMB/CIFS for data access transfer and integration across hybrid
environments.
6. Programming & Automation
- Proficiency in scripting or programming languages such as Python and
Bash.
- Experience automating infrastructure and AI workflows.
7. Soft Skills & Leadership
- Excellent problem-solving analytical thinking and communication
skills for engaging both technical and non-technical stakeholders.
- Proven ability to lead complex technical projects from requirements
gathering through architecture design and delivery.
- Strong business acumen with the ability to align technical solutions
with client challenges and objectives.
Qualifications:
- Bachelors/masters degree in computer science Information
Technology or a related field.
- Professional certifications in AI Infrastructure Containers and
Kubernetes are highly desirable such as RHCSA RHCE CNCF
certifications (CKA CKAD CKS) NVIDIA-Certified Associate - AI
Infrastructure and Operations
- Typically 8-10 years of hands-on experience in architecting and
implementing HPC AI/ML and container platform solutions within
hybrid or private cloud environments with a strong focus on
scalability performance and enterprise integration.
Required Skills:
LOAD BALANCINGNETWORKINGDEVOPSLINUX SYSTEM ADMINISTRATIONGPUCI/CDPROGRAMMINGRHCSAAUTOMATIONKUBERNETESSLESDOCKERPYTHON