Manager Platform Engineering (DevOps)

Randstad India


Job Location:

Hyderabad - India

Monthly Salary: Not Disclosed
Posted on: 4 days ago
Vacancies: 1 Vacancy

Job Summary

Key Responsibilities:

1. AI Platform & Cloud Architecture
Own and evolve cloud platform architecture supporting AI ML and GenAI workloads across all environments
Design platforms for model training fine-tuning high-availability inference batch and event-driven pipelines and long-running or agent-based workflows
Ensure platforms are cloud-native modular extensible and aligned with enterprise architecture standards
Enable multi-cloud portability (Azure AWS GCP) through abstraction of cloud dependencies
Partner with GenAI & Data Architects to align platform capabilities with RAG pipelines agent orchestration and data platform architectures

2. CI/CD & Automation
Design and implement end-to-end CI/CD pipelines for applications data pipelines ML models and GenAI prompts
Standardize environment promotion with automated testing approvals rollback and release controls
Integrate pipelines with source control artifact repositories model registries and prompt repositories
Implement progressive delivery patterns such as blue-green deployments canary releases and feature flags
Embed security scans quality gates and compliance checks directly into CI/CD workflows

3. Infrastructure as Code & Environment Standardization
Define and enforce Infrastructure-as-Code standards using Terraform ARM/Bicep and cloud SDKs
Automate provisioning of compute storage networking Kubernetes clusters and AI platform services
Ensure environments are reproducible version-controlled auditable and free from configuration drift

4. Observability Reliability & SRE Practices
Design and implement end-to-end observability including metrics logs and distributed tracing
Define and monitor SLIs and SLOs for AI data and platform services
Design for high availability fault tolerance and disaster recovery
Lead incident response root-cause analysis and post-incident reviews
Drive continuous reliability improvements using operational metrics

5. Cost Management & FinOps
Implement FinOps practices for AI and data platforms
Track and optimize infrastructure usage cost per inference and GenAI token consumption
Establish cost guardrails including budgets alerts auto-scaling and shutdown policies
Partner with architects and business stakeholders to balance accuracy latency scale and cost

6. Security Governance & Compliance
Embed security-by-design into platform architecture and delivery pipelines
Implement IAM secrets management encryption network segmentation and secure connectivity
Enable audit logging traceability and governance for model execution prompt usage and data access
Support internal and external audits penetration testing and compliance reviews

7. MLOps / LLMOps Enablement
Enable and operate MLOps and LLMOps platforms covering training serving monitoring versioning and rollback
Support automated evaluation retraining drift detection and performance degradation alerts
Ensure platforms support experimentation without compromising production stability

8. Collaboration & Leadership
Collaborate with GenAI & Data Architects AI Engineers Backend and Frontend Engineers Security QA and Delivery teams
Participate in Agile ceremonies release planning and roadmap discussions
Provide technical leadership and mentoring to DevOps and platform engineers
Define platform standards documentation and best practices
Act as a trusted advisor to leadership on scalability risk and cost

Qualifications

Bachelors degree in Computer Science Engineering or a related discipline
Masters degree preferred
Relevant certifications strongly desired (Azure/AWS/GCP Architect or DevOps Kubernetes Terraform)
8 12 years of experience in DevOps Platform Engineering Cloud Infrastructure or SRE roles
Proven experience designing and operating enterprise-scale production platforms
Hands-on experience supporting AI/ML and GenAI workloads in regulated or security-conscious environments
Deep expertise in at least one major cloud platform (Azure AWS or GCP)
Strong experience with CI/CD Infrastructure as Code Kubernetes and containerized workloads
Proven experience implementing observability reliability engineering and incident management practices
Strong understanding of cloud security governance and compliance requirements
Hands-on experience with cloud cost optimization and FinOps practices
Proven ability to lead and mentor platform teams and communicate effectively with executive stakeholders

Key Responsibilities: 1. AI Platform & Cloud Architecture Own and evolve cloud platform architecture supporting AI ML and GenAI workloads across all environments Design platforms for model training fine-tuning high-availability inference batch and event-driven pipelines and long-running or ...