Senior Lead SysOpsDevops Engineer

Integrant

Not Interested
Bookmark
الإبلاغ عن هذه الوظيفة

profile موقع الوظيفة:

القاهرة - مصر

profile الراتب شهرياً: لم يكشف
تاريخ النشر: نُشرت قبل 7 ساعة
عدد الوظائف الشاغرة: 1 عدد الوظائف الشاغرة

ملخص الوظيفة

Description

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales designing and running mission-critical GPU HPC and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.

This role carries both SysOps HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will Do

Presales & Business Development

Partner with sales and solution teams to identify and qualify new opportunities

Lead or support technical presales activities: discovery workshops RFP responses architecture presentations

Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients

Prepare high-quality technical materials

Act as a trusted technical advisor during client conversations proposing solutions aligned to business goals

In-Account Delivery SysOps & DevOps Execution

Operate directly within client accounts as a senior SysOps/DevOps engineer

Run troubleshoot and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on

Own Linux system administration at a deep level: kernel tuning storage networking performance profiling

Implement and maintain IaC pipelines GitOps workflows and CI/CD systems

Serve as the senior escalation point for complex operational incidents within accounts

Architecture & Solution Design

Design end-to-end platform architectures spanning cloud hybrid and on-premises HPC environments

Define workload isolation models networking architectures and storage strategies for multi-tenant platforms

Recommend and validate technology choices aligned to client scale budget and team maturity

Produce architecture decision records (ADRs) solution blueprints and technical runbooks

Technical Competencies & Requirements

1. Architecture & System Design

Design production-grade multi-cluster Kubernetes platforms:

RKE2 EKS (AWS) AKS (Azure) at enterprise scale

GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools

Hybrid cloud on-premises HPC infrastructure

Define and document:

Workload isolation: namespaces MIG partitioning multi-tenancy models

Networking: BGP peering Ingress controllers service mesh (Istio / Cilium)

Storage: Longhorn Ceph distributed and high-throughput file systems

2. Platform Engineering & GitOps Strategy

Define and enforce platform standards across the delivery lifecycle

GitOps tooling: ArgoCD Fleet declarative cluster management

CI/CD pipelines: Azure DevOps Jenkins build test promote

Infrastructure as Code: Terraform (modules remote state workspaces) Ansible

Standardize cluster bootstrapping app deployment lifecycle environment promotion (Dev QA Prod)

3. AI / GPU Infrastructure Architecture (Priority Competency)

Design and operate GPU compute platforms at scale:

GPU Operator deployment and lifecycle management

MIG (Multi-Instance GPU) partitioning for multi-tenant workloads

Advanced scheduling: Run:AI Kubernetes-native GPU scheduling (device plugins)

Understand AI workload classes and their infrastructure implications:

Distributed training workloads (data/model/pipeline parallelism)

Inference pipelines NVIDIA Triton Inference Server TensorRT optimization

Align infrastructure to the full AI stack:

CUDA stack cuDNN NCCL collective communication libraries

High-speed networking: InfiniBand (HDR/NDR) RoCE for RDMA

GPUDirect RDMA / GPUDirect Storage for low-latency data paths

4. Observability & Reliability Engineering

Define and implement full-stack observability:

Metrics: Prometheus Thanos (long-term retention multi-cluster)

Logs: Loki Fluent Bit

GPU telemetry: DCGM Exporter NVIDIA Nsight Systems

Build operational frameworks:

SLO / SLA definitions and error budget tracking

Alerting strategy noise reduction severity routing

Incident response playbooks and on-call runbooks

5. Security & Multi-Tenancy Architecture

Design zero-trust security postures for multi-tenant platforms

Secret management: HashiCorp Vault External Secrets Operator

Identity and access: IAM RBAC SSO/OIDC integration

Network isolation: NetworkPolicy micro-segmentation mTLS

Secure GPU sharing: MIG isolation VGPU licensing tenant boundary enforcement

6. HPC Data & Storage Architecture (Priority Competency)

Understand the high-performance storage for AI/HPC workloads:

GPUDirect Storage bypassing CPU for GPU-native I/O

Distributed file systems: Weka (high-throughput NFS/S3) Ceph (scalable object/block)

Storage tiering caching strategies and data lifecycle management

Size and validate storage architectures against workload I/O profiles

7. Operational Leadership & Linux Systems

Lead incident response and root cause analysis (RCA) for critical production issues

Define upgrade strategies change management procedures and disaster recovery plans

Write and maintain runbooks operational playbooks and knowledge base content

Integrate organizational processes compliance requirements and security policies into operational frameworks

Deep Linux expertise:

Kernel tuning (CPU governor NUMA IRQ affinity hugepages)

Storage I/O scheduling NVMe optimization

Network stack tuning for RDMA / InfiniBand

System performance profiling and bottleneck analysis

Candidate Profile Who You Are

you are comfortable running production systems.

You have stronger SysOps and HPC depth than DevOps breadth and you embrace that identity

You can shift fluidly between running a live incident presenting an architecture to a CTO and reviewing a POC demo environment

You communicate technical complexity clearly to engineers and to C-level stakeholders

You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations

You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions

You thrive in ambiguity and can scope both short POCs and long-horizon platform programs



Requirements

Required

10 years in platform/infrastructure engineering with at least 2 years in architect-level role

Proven hands-on experience operating Kubernetes at scale in production (multi-cluster multi-tenant)

Significant Linux systems administration experience kernel networking storage at a low level

HPC and/or GPU infrastructure experience physical GPU servers NCCL InfiniBand or high-speed fabrics

Demonstrable presales or client-facing experience

IaC experience: Terraform and/or Ansible in production environments

Strong understanding of GitOps and CI/CD pipelines in enterprise settings

Strongly Preferred

Experience with NVIDIA GPU Operator MIG partitioning Run:AI or equivalent GPU scheduling tooling

Knowledge of distributed AI training infrastructure (PyTorch DDP Horovod DeepSpeed) from an infrastructure perspective

Familiarity with NVIDIA Triton Inference Server or TensorRT deployment pipelines

Experience with Weka Ceph or GPUDirect Storage in HPC/AI environments

Hands-on experience with Vault External Secrets and zero-trust network architectures

Exposure to bare-metal provisioning and HPC cluster management (Slurm PBS or equivalent)

Certifications (Advantageous)

CKA / CKS (Certified Kubernetes Administrator / Security Specialist)

RHCE / RHCA (Red Hat Certified Engineer / Architect)

AWS Solutions Architect / Azure Solutions Architect Expert

HashiCorp Terraform Associate or Vault Associate

NVIDIA DLI certifications (GPU computing AI infrastructure)


Required Experience:

Senior IC

DescriptionWe are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales designing and running missio...
اعرض المزيد view more

عن الشركة

Company Logo

Integrant, Inc. is a custom software development company focused on providing tailor made software solutions to fit your needs to a tee. We strive to uncover your pain points and identify how our team can seamlessly integrate with you and your business for a one-team approach.

عرض صفحة الشركة عرض صفحة الشركة