AI Infrastructure Architect

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

What you will do (Key responsibilities)

1) Architect and deliver customer AI infrastructure (end-to-end)

Lead architecture and implementation for secure scalable AI/ML/LLM platforms based on customer requirements and constraints.
Produce implementation-ready artifacts: HLD/LLD reference architectures network/topology diagrams deployment plans runbooks and operational handover packs.
Translate business and technical requirements into a scalable target state and guide delivery teams through build rollout and production readiness.

2) Solve real enterprise constraints (network access topology)

Design enterprise network topologies with segmentation/isolation: private subnets route tables security policies egress control private endpoints controlled ingress patterns.
Work within common enterprise constraints
- Fixed network address plans (pre-approved CIDR ranges) IP allowlists/deny-lists and limited routing flexibility
- Private connectivity requirements (VPN/Direct Connect/FastConnect/ExpressRoute) no public endpoints and restricted DNS resolution
- Controlled administrative access (bastion/jump host privileged access management session recording time-bound access)
- Restricted egress (proxy-only outbound firewall-controlled destinations egress allowlists DNS filtering no direct internet)Ensure secure data movement and integration patterns for AI workloads (east-west and north-south traffic)
- Customer-managed encryption and key custody (KMS/HSM BYOK/HYOK key rotation certificate lifecycle)
- Strict TLS policies (mTLS approved ciphers enterprise PKI certificate pinning where required)
- Identity and access controls (SSO/SAML/OIDC RBAC/ABAC least privilege break-glass accounts)
- Data governance constraints (PII/PHI handling residency/sovereignty retention audit evidence requirements)
- Secure software supply chain (approved base images artifact signing SBOM vulnerability scanning patch SLAs)
- Endpoint controls (EDR agents OS hardening standards restricted packages golden images)
- Change management gates (CAB approvals limited maintenance windows separation of duties)
- Observability restrictions (logs cant leave tenant redaction/masking approved collectors/forwarders only)
- Multi-tenant isolation and policy boundaries (namespace isolation network policies runtime sandboxing)
- High availability & DR expectations (multi-zone patterns backup/restore failover runbooks RTO/RPO)

3) Security-by-design InfoSec approvals and guardrails for AI platforms

Lead InfoSec engagement: threat modeling control mapping evidence collection remediation plans and security signoffs for AI infrastructure.
Implement security controls and platform guardrails:
- TLS/SSL-only communication patterns; encryption-in-transit and encryption-at-rest
- API security: OAuth2/JWT/mTLS gateway policies request signing patterns where required
- Secrets management using vault/key management services rotation and lifecycle controls
- IAM and least-privilege access models; tenant/project isolation
- VM hardening (CIS-aligned baselines) patching strategy secure images
- Kill switches / emergency stop mechanisms for agents (tool-disable egress cut-off policy stop rollback runbooks)
- AI infra guardrails: controlled tool execution outbound allowlists boundary policies audit-ready logging

4) LLM hosting GPU infrastructure and scale

Architect LLM hosting patterns: managed endpoints self-hosted inference multi-model routing and workload isolation.
Design and operationalize GPU-based inference at scale:
- Capacity planning GPU node pools scaling policies cost/performance optimization
- Performance profiling and reliability patterns for inference services
Build container/Kubernetes-based AI platforms (OKE/EKS/AKS/GKE as applicable):
- Secure cluster designs namespaces/tenancy node isolation secrets and safe rollout strategies
- Support AI frameworks and application runtimes on Kubernetes for scale and portability

5) Observability reliability engineering and operational readiness

Define and implement observability across AI systems:
- Metrics logs traces audit trails and network call tracing
- Integration with enterprise observability tools (customer standard platforms)
Define SLIs/SLOs for AI services:
- Latency throughput error rates saturation GPU utilization queue depth retry behavior
Execute load testing and capacity validation for inference endpoints vector stores agent runtimes and integration services.
Build reliable ops workflows: incident response runbooks dashboards alerting and proactive health checks.

6) Disaster recovery and resilience for AI platforms

Design DR strategies for AI solutions:
- Multi-AD / multi-region patterns backup/restore for critical stores IaC-based rebuilds
- Failover runbooks RTO/RPO alignment and validation exercises
Ensure production-grade resilience and safe rollback for platform and application layers.

7) Red teaming and risk mitigation for AI infrastructure

Drive security validation for AI infrastructure and agent deployments:
- Attack surface review secrets leakage paths egress abuse scenarios
- Prompt/tool misuse impact assessment at infrastructure level
Implement mitigations and hardening measures with measurable controls.

8) Consulting leadership and stakeholder management

Act as a trusted technical advisor to customer platform network and security teams.
Communicate clearly with diverse stakeholders (CIO/CTO Security Infra App teams) and drive decisions under ambiguity.
Mentor engineers/architects conduct design reviews and build reusable delivery accelerators and blueprints.

Responsibilities

Required experience and qualifications

15 years of experience in infrastructure architecture cloud engineering or platform consulting with proven ownership of end-to-end architecture and delivery.
Strong fundamentals in networking operating systems distributed systems and enterprise security.
Proven experience delivering secure highly available platforms in regulated or enterprise environments.
Deep hands-on experience with:
- Cloud infrastructure (OCI preferred; AWS/Azure/GCP acceptable)
- Enterprise network design (VPC/VCN VPNs routing firewalls proxies private endpoints DNS)
- Kubernetes/container platforms (OKE/EKS/AKS/GKE) secure cluster patterns and scaling strategies
- Infrastructure-as-Code (Terraform strongly preferred) and automation (Python/shell)
- Observability stacks (logs/metrics/traces) and integration with enterprise monitoring tools
- IAM vault/key management secrets handling encryption standards and audit controls
Strong customer-facing skills: requirements discovery architecture documentation and delivery leadership.

Preferred (nice-to-have) skills

LLM inference serving (open models and/or managed endpoints) multi-model routing and AI workload isolation.
GPU platform engineering: scheduling node pool design performance tuning and cost controls.
Experience implementing agentic AI runtime patterns with safe tool execution and enterprise guardrails.
Hybrid and multi-cloud deployments including on-prem connectivity and enterprise integration patterns.
Familiarity with data platforms relevant to AI (vector stores metadata stores object storage patterns).

Core competencies (what we value)

Systems thinking and security-first architecture mindset
Strong problem solving in constrained enterprise environments
Crisp documentation and executive-ready communication
Hands-on delivery orientation (not just advisory)
Ownership urgency and accountability for production outcomes

Scope and impact (IC4 expectations)

Independently leads complex customer AI infrastructure programs from discovery through production and handover.
Unblocks security/network constraints and drives approvals with clear evidence and mitigations.
Establishes reusable referenceable blueprints (secure AI landing zones LLM hosting patterns DR templates observability baselines).
Raises the quality bar by mentoring teams and institutionalizing guardrails reliability practices and delivery accelerators.

Qualifications

Career Level - IC4

Required Experience:

Staff IC

DescriptionWhat you will do (Key responsibilities)1) Architect and deliver customer AI infrastructure (end-to-end)Lead architecture and implementation for secure scalable AI/ML/LLM platforms based on customer requirements and constraints.Produce implementation-ready artifacts: HLD/LLD reference arch...