AI DevOps Engineer

Anthrobyte

Job Location:

Hyderabad - India

Monthly Salary: Not Disclosed

Experience Required: 4-5years

Posted on: 2 hours ago

Vacancies: 1 Vacancy

Job Summary

01 THE OPPORTUNITY

A founding infrastructure leadership role.

Anthrobyte builds enterprise AI systems that move organisations from pilot to production-grade adoption. As we scale we need a platform engineer who can own the full infrastructure vision: model serving MLOps pipelines GPU cluster management observability and cloud architecture all as one coherent production-grade system.

This is less a traditional DevOps role and more a founding platform seat. You will work directly with AI engineers product leadership and enterprise clients to define how AI transformation is deployed scaled and made reliable inside complex organisations. You will be the person who makes AI products real.

HORIZON ONE FOUNDING MANDATE

AI DevOps Engineer

Own the full platform layer AI infrastructure MLOps pipelines model serving observability and cloud architecture as one coherent production-grade system.

GROWTH TRACK MERIT-BASED

Lead/Architect/Head of Platform Engineering

Grow into engineering leadership shaping platform strategy building and mentoring a DevOps team and defining how AI infrastructure scales with Anthrobytes client portfolio.

Requirements

02 RESPONSIBILITIES

Own the platform. Power the transformation.

Hyperscale AI Infrastructure

Design and operate AI infrastructure across multi-cloud environments (AWS GCP Azure) supporting LLM inference fine-tuning and RAG pipelines at production scale

Architect GPU cluster management and optimise inference throughput using vLLM Triton Inference Server TensorRT or equivalent serving frameworks

Own infrastructure-as-code (Terraform Pulumi) reproducible version-controlled disaster-recovery-ready environments across all deployments

Drive multi-region high-availability architecture decisions that reflect the reliability standards enterprise clients require

MLOps & Model Lifecycle

Build and maintain MLOps platforms model versioning experiment tracking automated retraining and deployment pipelines using MLflow Kubeflow or equivalent

Implement CI/CD pipelines (GitHub Actions ArgoCD Tekton) that support rapid model iteration without sacrificing production stability

Define promotion workflows from development to staging to production with rollback canary and blue-green strategies as standard practice

Kubernetes & Container Orchestration

Lead container orchestration at scale Kubernetes (EKS/GKE/AKS) Helm charts service mesh configuration and auto-scaling strategies for variable AI workloads

Configure GPU node pools resource quotas taints and tolerations and network policies for secure efficient AI workload scheduling

Own production incident response from detection through resolution to post-mortem and systemic fix

Observability & FinOps

Own observability end-to-end: latency GPU utilisation cost-per-inference model drift detection and SLO/SLA dashboards (Prometheus Grafana or equivalent)

Lead FinOps strategy for GPU compute spot instance management reserved capacity planning cost attribution across teams and client engagements

Surface infrastructure cost and reliability data to leadership and clients in clear actionable terms

Security Governance & Compliance

Enforce security and data governance standards across AI deployments access controls audit logging secret management and PII handling in inference pipelines

Support enterprise client compliance requirements including data residency model access controls and audit trail documentation

Cross-Functional Partnership

Translate AI engineer and product requirements into platform specifications and push back with alternatives when requirements are unrealistic or unsafe

Partner with client engineering teams during enterprise AI deployments acting as the technical infrastructure authority

The Growth Pathway

Demonstrate consistent excellence as a platform engineering leader and the scope expands. You will grow into Lead/Architect/Head of Platform Engineering building and mentoring a DevOps and MLOps team shaping infrastructure strategy across Anthrobytes full client portfolio and defining what production-grade enterprise AI deployment looks like at scale. This is not a title it is a level of ownership that must be earned and continually re-earned.

03 WHO YOU ARE

The profile we are searching for.

You think across the full stack from YAML to architecture from GPU cost to enterprise reliability SLAs. You are the kind of engineer who has felt the weight of a production incident at 2am and built the systems that prevent the next one. You are as comfortable presenting infrastructure trade-offs to a CTO as you are deep in a Terraform module.

You bring:

46 years in DevOps or platform engineering with at least 12 years specifically in AI or ML infrastructure

Demonstrated hyperscale experience infrastructure supporting millions of daily requests petabyte-scale data or multi-region distributed systems

Deep Kubernetes expertise: GPU node pools resource quotas network policies and production incident ownership

Hands-on production experience with at least one major LLM serving stack (vLLM Triton TGI Ray Serve or BentoML)

Strong Python and scripting capability you write automation not just configuration

Proficiency with IaC (Terraform preferred) and GitOps workflows as standard practice

Cloud practitioner depth in AWS GCP or Azure particularly compute networking and storage for AI workloads

Clear communicator able to translate infrastructure complexity into language that resonates with engineers product leads and enterprise clients alike

Bonus signals:

GPU FinOps Enterprise AI Governance Open-Source Contributions

Edge / On-Premise LLM Serving AI Consultancy Experience Regulated Industry Deployments

Multi-Cloud Architecture MLflow or Kubeflow Ownership Startup 01 Environment

04 WHAT WE OFFER

A rare kind of opportunity.

Founding platform engineering ownership greenfield infrastructure built your way with your architectural decisions

Direct access to engineering and product leadership from day one

The mandate to build the platform function the way it should be built AI-native observable and enterprise-reliable

Active involvement in enterprise AI deployment engagements real infrastructure challenges real clients real consequences

Access to GPU compute resources premium cloud credits and AI tooling subscriptions

Competitive compensation benchmarked to senior engineering market rates in Hyderabad

A culture where great infrastructure work is visible and celebrated not invisible

Required Skills:

02 RESPONSIBILITIES Own the platform. Power the transformation. Hyperscale AI Infrastructure Design and operate AI infrastructure across multi-cloud environments (AWS GCP Azure) supporting LLM inference fine-tuning and RAG pipelines at production scale Architect GPU cluster management and optimise inference throughput using vLLM Triton Inference Server TensorRT or equivalent serving frameworks Own infrastructure-as-code (Terraform Pulumi) reproducible version-controlled disaster-recovery-ready environments across all deployments Drive multi-region high-availability architecture decisions that reflect the reliability standards enterprise clients require MLOps & Model Lifecycle Build and maintain MLOps platforms model versioning experiment tracking automated retraining and deployment pipelines using MLflow Kubeflow or equivalent Implement CI/CD pipelines (GitHub Actions ArgoCD Tekton) that support rapid model iteration without sacrificing production stability Define promotion workflows from development to staging to production with rollback canary and blue-green strategies as standard practice Kubernetes & Container Orchestration Lead container orchestration at scale Kubernetes (EKS/GKE/AKS) Helm charts service mesh configuration and auto-scaling strategies for variable AI workloads Configure GPU node pools resource quotas taints and tolerations and network policies for secure efficient AI workload scheduling Own production incident response from detection through resolution to post-mortem and systemic fix Observability & FinOps Own observability end-to-end: latency GPU utilisation cost-per-inference model drift detection and SLO/SLA dashboards (Prometheus Grafana or equivalent) Lead FinOps strategy for GPU compute spot instance management reserved capacity planning cost attribution across teams and client engagements Surface infrastructure cost and reliability data to leadership and clients in clear actionable terms Security Governance & Compliance Enforce security and data governance standards across AI deployments access controls audit logging secret management and PII handling in inference pipelines Support enterprise client compliance requirements including data residency model access controls and audit trail documentation Cross-Functional Partnership Translate AI engineer and product requirements into platform specifications and push back with alternatives when requirements are unrealistic or unsafe Partner with client engineering teams during enterprise AI deployments acting as the technical infrastructure authority The Growth Pathway Demonstrate consistent excellence as a platform engineering leader and the scope expands. You will grow into Lead/Architect/Head of Platform Engineering building and mentoring a DevOps and MLOps team shaping infrastructure strategy across Anthrobytes full client portfolio and defining what production-grade enterprise AI deployment looks like at scale. This is not a title it is a level of ownership that must be earned and continually re-earned. 03 WHO YOU ARE The profile we are searching for. You think across the full stack from YAML to architecture from GPU cost to enterprise reliability SLAs. You are the kind of engineer who has felt the weight of a production incident at 2am and built the systems that prevent the next one. You are as comfortable presenting infrastructure trade-offs to a CTO as you are deep in a Terraform module. You bring: 46 years in DevOps or platform engineering with at least 12 years specifically in AI or ML infrastructure Demonstrated hyperscale experience infrastructure supporting millions of daily requests petabyte-scale data or multi-region distributed systems Deep Kubernetes expertise: GPU node pools resource quotas network policies and production incident ownership Hands-on production experience with at least one major LLM serving stack (vLLM Triton TGI Ray Serve or BentoML) Strong Python and scripting capability you write automation not just configuration Proficiency with IaC (Terraform preferred) and GitOps workflows as standard practice Cloud practitioner depth in AWS GCP or Azure particularly compute networking and storage for AI workloads Clear communicator able to translate infrastructure complexity into language that resonates with engineers product leads and enterprise clients alike Bonus signals: GPU FinOps Enterprise AI Governance Open-Source Contributions Edge / On-Premise LLM Serving AI Consultancy Experience Regulated Industry Deployments Multi-Cloud Architecture MLflow or Kubeflow Ownership Startup 01 Environment 04 WHAT WE OFFER A rare kind of opportunity. Founding platform engineering ownership greenfield infrastructure built your way with your architectural decisions Direct access to engineering and product leadership from day one The mandate to build the platform function the way it should be built AI-native observable and enterprise-reliable Active involvement in enterprise AI deployment engagements real infrastructure challenges real clients real consequences Access to GPU compute resources premium cloud credits and AI tooling subscriptions Competitive compensation benchmarked to senior engineering market rates in Hyderabad A culture where great infrastructure work is visible and celebrated not invisible

Required Education:

IITBits PilaniBTechB.E NIT

01 THE OPPORTUNITYA founding infrastructure leadership role.Anthrobyte builds enterprise AI systems that move organisations from pilot to production-grade adoption. As we scale we need a platform engineer who can own the full infrastructure vision: model serving MLOps pipelines GPU cluster manageme...