AI Inference Engineer

Future Fit

Not Interested
Bookmark
Report This Job

profile Job Location:

Johannesburg - South Africa

profile Monthly Salary: Not Disclosed
Posted on: 17 hours ago
Vacancies: 1 Vacancy

Job Summary

Job Description: AI Inference Engineer (vLLM and Kubernetes)

1. Role Overview & Strategic Context

1.1 Company Overview & Mission

We are architecting the future of intelligent enterprise solutions globally. As we move into 2026 our mission has evolved to not only lead in digital transformation but to pioneer the deployment of sovereign high-performance Artificial Intelligence. We believe that the true power of AI lies in its accessibility and operational efficiency. By leveraging cutting-edge open-source innovation and enterprise-grade infrastructure we are building the platforms that will power the next generation of automated intelligence for our clients and our internal operations.

Our engineering culture is rooted in the principles of DevOps SRE and radical automation. We value engineers who are not merely operators but architects of efficiency who take immense pride in the stability security and performance of the systems they an era where GPU resources are the new gold our goal is to achieve world-class inference density and latency through meticulous systems engineering.

1.2 Position Snapshot

Key Detail

Information

Job Title

AI Inference Engineer (vLLM and Kubernetes)

Location

Remote

Experience Level

Senior (5 Years in Systems/DevOps Engineering)

Language Requirements

Portuguese and English

Core Technical Pillar

RHEL vLLM NVIDIA GPU Operator Kubernetes/OpenShift

1.3 Role Summary

The AI Inference Engineer (vLLM and Kubernetes) is a critical highly specialized role sitting at the vanguard of the modern MLOps landscape. This position is designed for a senior engineer who possesses a unique hybrid of deep Red Hat Enterprise Linux (RHEL) systems administration expertise and modern AI infrastructure knowledge. Unlike traditional AI roles that focus on model training your focus will be the last mile of AI: engineering the high-performance inference platforms that serve Large Language Models (LLMs) to end-users at scale.

Your primary objective is to build secure and automate an enterprise-grade inference environment. This involves orchestrating vLLM (or equivalent high-throughput engines) within Kubernetes clusters ensuring that GPU resources are utilized at peak efficiency. You will be expected to treat infrastructure as code using Ansible to manage the foundational RHEL layer and Python to glue together complex AI workflows.

Strategic Context:

In 2026 the competitive advantage of an organization is defined by its Inference Velocity. Our goal is to reduce the cost of intelligence while maximizing output quality. We are looking for a proactive engineer who does not wait for a system alert to identify inefficiencies but actively hunts for bottlenecks in KV cache management continuous batching performance and GPU scheduling to refine our competitive edge.

1.4 Key Objectives for the Role

Infrastructure Reliability: Ensure 99.99% availability of our LLM inference endpoints through robust Kubernetes orchestration on RHEL nodes.

Operational Excellence: Implement end-to-end automation with Ansible to ensure that a new GPU-enabled node can be provisioned hardened and added to the cluster with zero manual intervention.

Cost Efficiency: Monitor and optimize GPU utilization (NVIDIA/AMD) to ensure we are achieving the highest possible tokens-per-second.

Security & Compliance: Harden the RHEL environment and container stack to meet the evolving POPIA and international data security standards relevant in 2026.

2. Key Responsibilities

The AI Inference Engineer is tasked with the end-to-end lifecycle of high-performance model serving. This role bridges the gap between raw machine learning models and production-grade software services ensuring that our AI capabilities are delivered with sub-second latency and enterprise-level reliability on a Red Hat Enterprise Linux (RHEL) foundation.

2.1 Inference Platform Engineering

Architectural Design: Design and deploy scalable inference architectures using vLLM within Kubernetes and Red Hat OpenShift environments to support Large Language Model (LLM) serving.

GPU Orchestration: Install configure and maintain the NVIDIA GPU Operator to manage GPU resources ensuring seamless hardware acceleration for containerized workloads.

Memory Optimization: Implement and tune advanced memory management techniques including PagedAttention to optimize KV cache efficiency and maximize GPU memory utilization.

Throughput Enhancement: Configure and manage continuous batching and speculative decoding strategies to increase inference throughput and reduce Time Per Output Token (TPOT).

Model Quantization: Collaborate with data science teams to implement and deploy quantized models (AWQ GPTQ FP8) to balance hardware requirements with inference precision.

2.2 Automation & MLOps Infrastructure

Infrastructure as Code (IaC): Develop and maintain complex Ansible playbooks for the automated provisioning and lifecycle management of RHEL nodes specifically tailored for GPU-accelerated workloads.

Node Hardening: Automate the installation of NVIDIA drivers CUDA toolkits and container runtimes (CRI-O/Docker) across the RHEL fleet.

Kernel Tuning: Perform low-level RHEL system tuning including HugePages configuration and CPU pinning to support high-demand AI workloads.

CI/CD Pipeline Development: Build and optimize Jenkins and GitLab CI pipelines for automated model containerization versioning and deployment using GitOps methodologies.

Environment Standardization: Ensure parity between development staging and production environments across hybrid-cloud footprints (on-premise KVM/VMware and public cloud).

Operational Excellence Callout:

A primary accountability of this role is the elimination of technical toil. You will be expected to proactively identify manual processes in the model deployment lifecycle and engineer automated self-healing solutions that reduce the Time-to-Inference for new model versions.

2.3 Performance Monitoring & Reliability Engineering

Benchmarking: Conduct rigorous performance benchmarking for LLM endpoints tracking critical metrics such as Time to First Token (TTFT) and total requests per second.

GPU Telemetry: Implement advanced monitoring solutions using Prometheus and Grafana to track real-time GPU utilization temperature and memory bandwidth (utilizing NVIDIA DCGM).

High Availability: Design and test failover strategies and load balancing for inference services to ensure uninterrupted AI availability during hardware failures or maintenance.

Capacity Planning: Analyze usage trends to provide data-driven recommendations for GPU infrastructure scaling and cloud resource allocation.

2.4 Advanced RHEL Administration & Security

OS Lifecycle Management: Manage RHEL system updates and patch cycles in coordination with Kubernetes node drains to ensure zero-downtime maintenance.

Container Security: Enforce rigorous container security standards including the use of rootless Podman where applicable and managing security contexts in Kubernetes.

Compliance Hardening: Utilize OpenSCAP to ensure all AI nodes comply with enterprise security baselines and data sovereignty requirements.

Access Control: Implement and manage SELinux policies to provide fine-grained mandatory access control for inference processes and model weights.

3. Required Technical Skills & Qualifications

The AI Inference Engineer is a specialized role requiring a convergence of deep systems engineering modern DevOps practices and high-performance machine learning operationalization. Candidates must demonstrate not only the ability to manage these technologies individually but also the architectural insight to integrate them into a seamless high-throughput inference fabric.

3.1 Professional Experience

Core Tenure: A minimum of 5 years of professional experience in a Senior DevOps MLOps SRE or Systems Engineering role.

Domain Focus: At least 2 years of focused experience in AI infrastructure management specifically dealing with large-scale model deployment or high-concurrency inference serving.

Production Track Record: Proven history of maintaining 99.9% availability for production-grade workloads in a containerized Linux environment.

3.2 Red Hat Enterprise Linux (RHEL) Mastery

The candidate must be an expert Linux practitioner capable of navigating the RHEL 8 and 9 ecosystems with precision.

System Internals: Deep knowledge of RHEL system administration including systemd firewalld LVM and advanced networking configurations (TCP/IP tuning for low-latency).

Performance Tuning: Ability to perform low-level kernel tuning and use tools like tuned-adm top/htop and perf to optimize RHEL nodes for GPU-intensive workloads.

Security & Compliance: Mastery of SELinux (policy management and troubleshooting) and experience using OpenSCAP for automated compliance auditing against enterprise security baselines.

3.3 Containerization & Orchestration

Kubernetes / OpenShift: Expert-level proficiency in Kubernetes administration specifically using Red Hat OpenShift. This includes managing custom resource definitions (CRDs) operators and complex ingress controllers.

GPU-Aware Scheduling: Production experience with NVIDIA GPU Operator NVIDIA Device Plugins and managing Multi-Instance GPU (MIG) profiles.

Runtime Environments: Deep understanding of CRI-O or Docker runtimes and their interactions with the NVIDIA Container Toolkit.

CRITICAL REQUIREMENT: vLLM Production Experience

The 2026 landscape necessitates specialized inference engines. Candidates must have hands-on production-level experience deploying and tuning vLLM. You should be able to explain and implement PagedAttention continuous batching configurations and KV cache sizing to maximize tokens-per-second throughput.

3.4 Automation & Scripting Proficiencies

Ansible Expertise: Advanced capability in writing modular Ansible roles and playbooks for infrastructure-as-code. This includes automating RHEL node setup driver installations and security patches.

Python Development: Strong proficiency in Python for building automation tools glue code for AI APIs (FastAPI/Flask) and data processing scripts.

Shell Scripting: Mastery of Bash for rapid system-level troubleshooting and automation within the RHEL environment.

3.5 Cloud Platforms & Infrastructure

Public Cloud Mastery: Proven experience managing GPU-accelerated instances on at least one major platform: AWS (P4/P5 instances) Azure (ND-series) or GCP (A3/H100 instances).

Hybrid Cloud Networking: Understanding of VPC/VNet configurations Direct Connect/ExpressRoute and high-performance networking fabrics (EFA/InfiniBand).

Virtualization: Familiarity with KVM or VMware ESXi environments for hosting private-cloud AI workloads.

3.6 Educational Background & Certifications

A Bachelors degree in Computer Science Information Technology or a related Engineering field is the absence of a degree significant verifiable industry experience (8 years) will be considered.

Certification Type

Desired Credentials

Linux Systems

Red Hat Certified Engineer (RHCE) or Red Hat Certified Architect (RHCA)

Cloud Orchestration

Certified Kubernetes Administrator (CKA) or Red Hat Certified Specialist in OpenShift

Cloud Provider

AWS Certified Solutions Architect (Pro) or Azure Solutions Architect Expert

4. Desired Soft Skills & Professional Attributes

Technical mastery is the foundation of this role but professional excellence at Company Name is defined by the non-technical attributes that enable a Senior Engineer to drive organizational the high-stakes rapidly evolving 2026 AI landscape we seek a leader who combines technical depth with the cognitive agility and interpersonal skills required to navigate complex infrastructure challenges.

4.1 Proactive Problem-Solver & Strategic Foresight

Anticipatory Engineering: You do not wait for a monitoring alert to trigger a response. You proactively hunt for system inefficiencies latent bottlenecks in GPU memory allocation or drifts in inference latency before they impact the end-user experience.

Root Cause Obsession: When an incident occurs you possess the tenacity to look beyond the immediate fix. You strive to understand the underlying architectural or systemic flaws and engineer permanent automated remediations.

Operational Intuition: Ability to foresee how changes in model architecture or vLLM configurations will ripple through the Kubernetes cluster preventing cascading failures through careful planning and testing.

4.2 Analytical Mindset & Data-Driven Decision Making

Metrics-First Approach: You rely on hard dataGPU utilization percentages Time-to-First-Token (TTFT) and throughput benchmarksto justify architectural changes or infrastructure investments.

Cost Optimization Focus: In a high-cost GPU environment you possess a keen eye for fiscal efficiency. You use analytical models to balance performance requirements with the economic realities of cloud and on-premise resource consumption.

Evidence-Based Troubleshooting: Ability to synthesize information from disparate logs telemetry streams and kernel traces to build a clear evidence-based narrative of system behavior.

Professional Attribute Focus: The Platform as a Product Mindset

Success in this role requires treating the AI Inference Platform not just as a set of servers but as a product served to our internal Data Science and Machine Learning teams. You must demonstrate the empathy to understand their needs and the professional rigor to deliver a platform that is reliable performant and easy to use.

4.3 Ownership Accountability & Resilience

Extreme Ownership: You view the reliability and performance of the inference platform as your personal responsibility. You are a self-starter who identifies a gap and takes the initiative to close it without needing explicit instruction.

Reliability Under Pressure: You maintain a calm methodical approach during high-pressure system outages or critical deployment windows serving as a stabilizing force for the wider engineering team.

High Standards: A commitment to excellence in everything from code quality in Ansible playbooks to the clarity of technical documentation.

4.4 Collaborative Spirit & Technical Diplomacy

Cross-Functional Partnership: Ability to work effectively with Data Scientists ML Engineers and Platform Architects. You can translate complex low-level infrastructure constraints into meaningful high-level impact statements for your peers.

Mentorship and Knowledge Sharing: You take pride in uplifting the teams collective intelligence by documenting your findings conducting peer reviews and mentoring junior engineers in RHEL and Kubernetes best practices.

Conflict Resolution: Capability to navigate technical disagreements with a focus on objective outcomes and professional respect ensuring the best architectural path is chosen for the organization.

4.5 Adaptability & Continuous Technical Evolution

Lifelong Learner: You have a genuine passion for the bleeding edge. You actively follow the evolution of the vLLM project Kubernetes enhancements and Red Hats AI roadmap to ensure our stack remains modern and competitive.

Intellectual Flexibility: The 2026 AI landscape changes weekly. You are comfortable pivoting your technical strategy when a more efficient tool model format or optimization technique emerges in the community.

Context Awareness: Understanding the unique challenges and opportunities within the local tech ecosystem including power constraints and data sovereignty requirements.


Job Description: AI Inference Engineer (vLLM and Kubernetes) 1. Role Overview & Strategic Context 1.1 Company Overview & Mission We are architecting the future of intelligent enterprise solutions globally. As we move into 2026 our mission has evolved to not only lead in digital transformation but to...
View more view more

Key Skills

  • ASP.NET
  • Health Education
  • Fashion Designing
  • Fiber
  • Investigation