AI Infrastructure Architect

Oracle

Not Interested
Bookmark
Report This Job

profile Job Location:

Bengaluru - India

profile Monthly Salary: Not Disclosed
Posted on: 14 hours ago
Vacancies: 1 Vacancy

Job Summary

Description

What you will do (Key responsibilities)

1) Architect and deliver customer AI infrastructure (end-to-end)

  • Lead architecture and implementation for secure scalable AI/ML/LLM platforms based on customer requirements and constraints.
  • Produce implementation-ready artifacts: HLD/LLD reference architectures network/topology diagrams deployment plans runbooks and operational handover packs.
  • Translate business and technical requirements into a scalable target state and guide delivery teams through build rollout and production readiness.

2) Solve real enterprise constraints (network access topology)

  • Design enterprise network topologies with segmentation/isolation: private subnets route tables security policies egress control private endpoints controlled ingress patterns.
  • Work within common enterprise constraints
    • Fixed network address plans (pre-approved CIDR ranges) IP allowlists/deny-lists and limited routing flexibility
    • Private connectivity requirements (VPN/Direct Connect/FastConnect/ExpressRoute) no public endpoints and restricted DNS resolution
    • Controlled administrative access (bastion/jump host privileged access management session recording time-bound access)
    • Restricted egress (proxy-only outbound firewall-controlled destinations egress allowlists DNS filtering no direct internet)Ensure secure data movement and integration patterns for AI workloads (east-west and north-south traffic)
    • Customer-managed encryption and key custody (KMS/HSM BYOK/HYOK key rotation certificate lifecycle)
    • Strict TLS policies (mTLS approved ciphers enterprise PKI certificate pinning where required)
    • Identity and access controls (SSO/SAML/OIDC RBAC/ABAC least privilege break-glass accounts)
    • Data governance constraints (PII/PHI handling residency/sovereignty retention audit evidence requirements)
    • Secure software supply chain (approved base images artifact signing SBOM vulnerability scanning patch SLAs)
    • Endpoint controls (EDR agents OS hardening standards restricted packages golden images)
    • Change management gates (CAB approvals limited maintenance windows separation of duties)
    • Observability restrictions (logs cant leave tenant redaction/masking approved collectors/forwarders only)
    • Multi-tenant isolation and policy boundaries (namespace isolation network policies runtime sandboxing)
    • High availability & DR expectations (multi-zone patterns backup/restore failover runbooks RTO/RPO)

3) Security-by-design InfoSec approvals and guardrails for AI platforms

  • Lead InfoSec engagement: threat modeling control mapping evidence collection remediation plans and security signoffs for AI infrastructure.
  • Implement security controls and platform guardrails:
    • TLS/SSL-only communication patterns; encryption-in-transit and encryption-at-rest
    • API security: OAuth2/JWT/mTLS gateway policies request signing patterns where required
    • Secrets management using vault/key management services rotation and lifecycle controls
    • IAM and least-privilege access models; tenant/project isolation
    • VM hardening (CIS-aligned baselines) patching strategy secure images
    • Kill switches / emergency stop mechanisms for agents (tool-disable egress cut-off policy stop rollback runbooks)
    • AI infra guardrails: controlled tool execution outbound allowlists boundary policies audit-ready logging

4) LLM hosting GPU infrastructure and scale

  • Architect LLM hosting patterns: managed endpoints self-hosted inference multi-model routing and workload isolation.
  • Design and operationalize GPU-based inference at scale:
    • Capacity planning GPU node pools scaling policies cost/performance optimization
    • Performance profiling and reliability patterns for inference services
  • Build container/Kubernetes-based AI platforms (OKE/EKS/AKS/GKE as applicable):
    • Secure cluster designs namespaces/tenancy node isolation secrets and safe rollout strategies
    • Support AI frameworks and application runtimes on Kubernetes for scale and portability

5) Observability reliability engineering and operational readiness

  • Define and implement observability across AI systems:
    • Metrics logs traces audit trails and network call tracing
    • Integration with enterprise observability tools (customer standard platforms)
  • Define SLIs/SLOs for AI services:
    • Latency throughput error rates saturation GPU utilization queue depth retry behavior
  • Execute load testing and capacity validation for inference endpoints vector stores agent runtimes and integration services.
  • Build reliable ops workflows: incident response runbooks dashboards alerting and proactive health checks.

6) Disaster recovery and resilience for AI platforms

  • Design DR strategies for AI solutions:
    • Multi-AD / multi-region patterns backup/restore for critical stores IaC-based rebuilds
    • Failover runbooks RTO/RPO alignment and validation exercises
  • Ensure production-grade resilience and safe rollback for platform and application layers.

7) Red teaming and risk mitigation for AI infrastructure

  • Drive security validation for AI infrastructure and agent deployments:
    • Attack surface review secrets leakage paths egress abuse scenarios
    • Prompt/tool misuse impact assessment at infrastructure level
  • Implement mitigations and hardening measures with measurable controls.

8) Consulting leadership and stakeholder management

  • Act as a trusted technical advisor to customer platform network and security teams.
  • Communicate clearly with diverse stakeholders (CIO/CTO Security Infra App teams) and drive decisions under ambiguity.
  • Mentor engineers/architects conduct design reviews and build reusable delivery accelerators and blueprints.


Responsibilities

Required experience and qualifications

  • 15 years of experience in infrastructure architecture cloud engineering or platform consulting with proven ownership of end-to-end architecture and delivery.
  • Strong fundamentals in networking operating systems distributed systems and enterprise security.
  • Proven experience delivering secure highly available platforms in regulated or enterprise environments.
  • Deep hands-on experience with:
    • Cloud infrastructure (OCI preferred; AWS/Azure/GCP acceptable)
    • Enterprise network design (VPC/VCN VPNs routing firewalls proxies private endpoints DNS)
    • Kubernetes/container platforms (OKE/EKS/AKS/GKE) secure cluster patterns and scaling strategies
    • Infrastructure-as-Code (Terraform strongly preferred) and automation (Python/shell)
    • Observability stacks (logs/metrics/traces) and integration with enterprise monitoring tools
    • IAM vault/key management secrets handling encryption standards and audit controls
  • Strong customer-facing skills: requirements discovery architecture documentation and delivery leadership.

Preferred (nice-to-have) skills

  • LLM inference serving (open models and/or managed endpoints) multi-model routing and AI workload isolation.
  • GPU platform engineering: scheduling node pool design performance tuning and cost controls.
  • Experience implementing agentic AI runtime patterns with safe tool execution and enterprise guardrails.
  • Hybrid and multi-cloud deployments including on-prem connectivity and enterprise integration patterns.
  • Familiarity with data platforms relevant to AI (vector stores metadata stores object storage patterns).

Core competencies (what we value)

  • Systems thinking and security-first architecture mindset
  • Strong problem solving in constrained enterprise environments
  • Crisp documentation and executive-ready communication
  • Hands-on delivery orientation (not just advisory)
  • Ownership urgency and accountability for production outcomes

Scope and impact (IC4 expectations)

  • Independently leads complex customer AI infrastructure programs from discovery through production and handover.
  • Unblocks security/network constraints and drives approvals with clear evidence and mitigations.
  • Establishes reusable referenceable blueprints (secure AI landing zones LLM hosting patterns DR templates observability baselines).
  • Raises the quality bar by mentoring teams and institutionalizing guardrails reliability practices and delivery accelerators.


Qualifications

Career Level - IC4




Required Experience:

Staff IC

DescriptionWhat you will do (Key responsibilities)1) Architect and deliver customer AI infrastructure (end-to-end)Lead architecture and implementation for secure scalable AI/ML/LLM platforms based on customer requirements and constraints.Produce implementation-ready artifacts: HLD/LLD reference arch...
View more view more

Key Skills

  • Ruby
  • Disaster Recovery
  • Active Directory
  • SOA
  • Cloud
  • IaaS
  • PowerShell
  • AWS
  • Infrastructure
  • Linux
  • VPN
  • Hyper-V
  • VM
  • IP
  • Identity

About Company

Company Logo

As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity. We know that true innovation starts when eve ... View more

View Profile View Profile