Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailHello
Infrastructure Engineer - Software Engineer Infrastructure & Hardware Optimization - Remote
We have below job opening.
If you are interested and your experience match with job description.
Please send your updated
Software Engineer Infrastructure & Hardware Optimization
Location: SF CA Portland OR Dallas TX - Remote but need to be local of respective location
Duration: 6 Months Contract
Job Description: We are seeking a skilled low-level systems engineer to join the team. This individual will focus on infrastructure software that detects configures and optimizes AI inference pipelines across heterogeneous hardware accelerators (e.g. NVIDIA / AMD GPUs TPUs AWS Inferentia FPGAs). You will work on hardware abstraction layers containerized runtime environments benchmarking telemetry and driver orchestration logic for multi-cloud agentic inference deployments.
Ideal Experience:
47 years experience in systems software or infrastructure engineering preferably with exposure to AI/ML workloads.
Deep expertise in CUDA NCCL ROCm or other accelerator programming frameworks.
Familiarity with LLM inference runtimes (TensorRT-LLM vLLM ONNXRuntime).
Experience with Kubernetes scheduling device plugin development and runtime patching for heterogeneous compute.
Strong Python/C and Linux systems programming skills.
Passion for building scalable portable and secure AI infrastructure.
Responsibilities:
Design and implement cross-platform hardware detection systems for GPUs/TPUs/NPUs using CUDA ROCm and low-level runtime interfaces.
Build and maintain plugin-based infrastructure for capability scoring power efficiency tuning and memory optimization.
Develop hardware abstraction layers (HAL) and performance benchmarking tools to optimize AI agents for cloud-native inference.
Extend container-based MLOps systems (Docker/Kubernetes) with support for hardware-specific runtime containers (e.g. TensorRT vLLM ROCm).
Automate driver validation container security hardening and runtime health monitoring across deployments.
Integrate telemetry systems (Prometheus Grafana) to surface per-device inference performance metrics and health status.
Collaborate with solutions and DevOps teams to ensure hardware-aware agent deployment across cloud providers.
Additional Information :
All your information will be kept confidential according to EEO guidelines.
Remote Work :
Yes
Employment Type :
Contract
Remote