DescriptionWhat you will do (Key responsibilities)
1) Architect and deliver customer AI infrastructure (end-to-end)
- Lead architecture and implementation for secure scalable AI/ML/LLM platforms based on customer requirements and constraints.
- Produce implementation-ready artifacts: HLD/LLD reference architectures network/topology diagrams deployment plans runbooks and operational handover packs.
- Translate business and technical requirements into a scalable target state and guide delivery teams through build rollout and production readiness.
2) Solve real enterprise constraints (network access topology)
- Design enterprise network topologies with segmentation/isolation: private subnets route tables security policies egress control private endpoints controlled ingress patterns.
- Work within common enterprise constraints
- Fixed network address plans (pre-approved CIDR ranges) IP allowlists/deny-lists and limited routing flexibility
- Private connectivity requirements (VPN/Direct Connect/FastConnect/ExpressRoute) no public endpoints and restricted DNS resolution
- Controlled administrative access (bastion/jump host privileged access management session recording time-bound access)
- Restricted egress (proxy-only outbound firewall-controlled destinations egress allowlists DNS filtering no direct internet)Ensure secure data movement and integration patterns for AI workloads (east-west and north-south traffic)
- Customer-managed encryption and key custody (KMS/HSM BYOK/HYOK key rotation certificate lifecycle)
- Strict TLS policies (mTLS approved ciphers enterprise PKI certificate pinning where required)
- Identity and access controls (SSO/SAML/OIDC RBAC/ABAC least privilege break-glass accounts)
- Data governance constraints (PII/PHI handling residency/sovereignty retention audit evidence requirements)
- Secure software supply chain (approved base images artifact signing SBOM vulnerability scanning patch SLAs)
- Endpoint controls (EDR agents OS hardening standards restricted packages golden images)
- Change management gates (CAB approvals limited maintenance windows separation of duties)
- Observability restrictions (logs cant leave tenant redaction/masking approved collectors/forwarders only)
- Multi-tenant isolation and policy boundaries (namespace isolation network policies runtime sandboxing)
- High availability & DR expectations (multi-zone patterns backup/restore failover runbooks RTO/RPO)
3) Security-by-design InfoSec approvals and guardrails for AI platforms
- Lead InfoSec engagement: threat modeling control mapping evidence collection remediation plans and security signoffs for AI infrastructure.
- Implement security controls and platform guardrails:
- TLS/SSL-only communication patterns; encryption-in-transit and encryption-at-rest
- API security: OAuth2/JWT/mTLS gateway policies request signing patterns where required
- Secrets management using vault/key management services rotation and lifecycle controls
- IAM and least-privilege access models; tenant/project isolation
- VM hardening (CIS-aligned baselines) patching strategy secure images
- Kill switches / emergency stop mechanisms for agents (tool-disable egress cut-off policy stop rollback runbooks)
- AI infra guardrails: controlled tool execution outbound allowlists boundary policies audit-ready logging
4) LLM hosting GPU infrastructure and scale
- Architect LLM hosting patterns: managed endpoints self-hosted inference multi-model routing and workload isolation.
- Design and operationalize GPU-based inference at scale:
- Capacity planning GPU node pools scaling policies cost/performance optimization
- Performance profiling and reliability patterns for inference services
- Build container/Kubernetes-based AI platforms (OKE/EKS/AKS/GKE as applicable):
- Secure cluster designs namespaces/tenancy node isolation secrets and safe rollout strategies
- Support AI frameworks and application runtimes on Kubernetes for scale and portability
5) Observability reliability engineering and operational readiness
- Define and implement observability across AI systems:
- Metrics logs traces audit trails and network call tracing
- Integration with enterprise observability tools (customer standard platforms)
- Define SLIs/SLOs for AI services:
- Latency throughput error rates saturation GPU utilization queue depth retry behavior
- Execute load testing and capacity validation for inference endpoints vector stores agent runtimes and integration services.
- Build reliable ops workflows: incident response runbooks dashboards alerting and proactive health checks.
6) Disaster recovery and resilience for AI platforms
- Design DR strategies for AI solutions:
- Multi-AD / multi-region patterns backup/restore for critical stores IaC-based rebuilds
- Failover runbooks RTO/RPO alignment and validation exercises
- Ensure production-grade resilience and safe rollback for platform and application layers.
7) Red teaming and risk mitigation for AI infrastructure
- Drive security validation for AI infrastructure and agent deployments:
- Attack surface review secrets leakage paths egress abuse scenarios
- Prompt/tool misuse impact assessment at infrastructure level
- Implement mitigations and hardening measures with measurable controls.
8) Consulting leadership and stakeholder management
- Act as a trusted technical advisor to customer platform network and security teams.
- Communicate clearly with diverse stakeholders (CIO/CTO Security Infra App teams) and drive decisions under ambiguity.
- Mentor engineers/architects conduct design reviews and build reusable delivery accelerators and blueprints.
ResponsibilitiesRequired experience and qualifications
- 15 years of experience in infrastructure architecture cloud engineering or platform consulting with proven ownership of end-to-end architecture and delivery.
- Strong fundamentals in networking operating systems distributed systems and enterprise security.
- Proven experience delivering secure highly available platforms in regulated or enterprise environments.
- Deep hands-on experience with:
- Cloud infrastructure (OCI preferred; AWS/Azure/GCP acceptable)
- Enterprise network design (VPC/VCN VPNs routing firewalls proxies private endpoints DNS)
- Kubernetes/container platforms (OKE/EKS/AKS/GKE) secure cluster patterns and scaling strategies
- Infrastructure-as-Code (Terraform strongly preferred) and automation (Python/shell)
- Observability stacks (logs/metrics/traces) and integration with enterprise monitoring tools
- IAM vault/key management secrets handling encryption standards and audit controls
- Strong customer-facing skills: requirements discovery architecture documentation and delivery leadership.
Preferred (nice-to-have) skills
- LLM inference serving (open models and/or managed endpoints) multi-model routing and AI workload isolation.
- GPU platform engineering: scheduling node pool design performance tuning and cost controls.
- Experience implementing agentic AI runtime patterns with safe tool execution and enterprise guardrails.
- Hybrid and multi-cloud deployments including on-prem connectivity and enterprise integration patterns.
- Familiarity with data platforms relevant to AI (vector stores metadata stores object storage patterns).
Core competencies (what we value)
- Systems thinking and security-first architecture mindset
- Strong problem solving in constrained enterprise environments
- Crisp documentation and executive-ready communication
- Hands-on delivery orientation (not just advisory)
- Ownership urgency and accountability for production outcomes
Scope and impact (IC4 expectations)
- Independently leads complex customer AI infrastructure programs from discovery through production and handover.
- Unblocks security/network constraints and drives approvals with clear evidence and mitigations.
- Establishes reusable referenceable blueprints (secure AI landing zones LLM hosting patterns DR templates observability baselines).
- Raises the quality bar by mentoring teams and institutionalizing guardrails reliability practices and delivery accelerators.
QualificationsCareer Level - IC4