Cloud Infrastructure Operations Leader

Vidpro Consultancy Services

Posted on : 16-07-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Noida - India

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 16-07-2025

Job Description

1. Role Overview

The Operations & Service Support Manager ensures 24 7 operational excellence and customer satisfaction for our cloud infrastructure offerings including GPUaccelerated compute solutions. This role oversees daytoday operations manages support teams (Tier 2 3) and collaborates closely with product and engineering teams to maintain high availability performance and robust service for enterprise customers running AI HPC or other missioncritical workloads.

2. Key Responsibilities

1. Operational Oversight & Service Management

o Lead and coordinate daily operations in multicloud or hybrid environments (e.g. AWS Azure GCP onprem HPC).

o Maintain operational dashboards (uptime ticket volumes SLAs) and proactively address performance or capacity bottlenecks.

o Ensure adherence to ITIL or other standard frameworks for incident change and problem management.

2. Team Leadership & Support Structure

o Manage Support Tiers ( L2 L3) and operations staff (NOC monitoring specialists) to handle escalations incident triage and rootcause analysis.

o Set clear KPIs and SOPs for the team focusing on quick resolution times high first contact resolution rates and continuous improvements.

o Coordinate training runbooks and knowledge transfer to ensure each tier has the expertise needed for AI/GPU workloads.

3. Incident & Problem Management

o Oversee major incidents and ensure timely resolution for critical outages or severe performance degradations especially in GPU based clusters.

o Chair regular post incident reviews (RCAs) track corrective actions and drive improvements to reduce recurrence.

o Maintain strong collaboration with product and SRE/engineering teams to address underlying code or architectural issues.

4. Service Assurance & Continuous Improvement

o Proactively monitor metrics and logs (e.g. GPU utilization HPC job performance cost anomalies) to spot potential issues before they escalate.

o Drive automation initiatives (in partnership with DevOps or SRE) to reduce manual toil improve deployment flows and streamline maintenance tasks.

o Champion reliability best practices and riskmitigation strategies aligned with organizational SLAs and error budgets.

5. Stakeholder & Customer Engagement

o Act as a liaison between support/ops teams and key customers ensuring visibility into operational performance and planned maintenance windows.

o Support customer success teams by providing insights on usage trends capacity needs and support ticket data.

o Escalate customer concerns and feedback to the product roadmap when recurring patterns emerge.

6. Resource & Vendor Management

o Manage relationships with external vendors and partners (e.g. GPU hardware providers colocation/DC hosts cloud service providers).

o Ensure optimal resource allocation whether GPU nodes highspeed storage or other HPC components to meet service demands and cost targets.

o Track operational budgets negotiate contracts and control OPEX/CAPEX in alignment with company goals.

7. Compliance & Security

o Implement and enforce security policies (access controls patching vulnerability management) for HPC/GPU clusters and cloud environments.

o Work with InfoSec teams to maintain compliance (SOC 2 ISO 27001 etc.) and manage data governance or audit requirements.

3. Qualifications & Skills

1. Education & Experience

o Bachelor s or Master s in Computer Science Engineering or related field.

o 8 years in operations / support management roles with 3 years in cloud infrastructure or HPC/AI environments.

2. Technical & Domain Expertise

o Strong understanding of cloud computing concepts (IaaS PaaS containers virtualization) and GPUaccelerated computing (NVIDIA GPUs HPC schedulers).

o Familiarity with infrastructure automation (Terraform Ansible) and observability tools (Prometheus Grafana Datadog etc.).

o Knowledge of distributed systems HPC clusters performance tuning and relevant DevOps/SRE practices.

3. Operations Management

o Proven track record implementing ITIL / ITSM frameworks for incident change and problem management at scale.

o Experience running 24 7 support teams establishing SLAs and delivering on operational KPIs.

4. Leadership & Communication

o Excellent peoplemanagement and coaching skills able to motivate diverse teams across geographies and time zones.

o Strong communication capable of engaging both senior executive stakeholders and front line support engineers.

o Adept at crisis management; calm under pressure and systematic in escalation protocols.

team leadership,operations management,communication skills,operations,infrastructure,performance tuning,crisis management,gpu-accelerated computing,cloud computing,observability tools,cloud,infrastructure automation,itil,multicloud environments,incident management

Employment Type

Full Time

Company Industry

Key Skills

Apply Now

About Company

Vidpro Consultancy Services

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Cloud Infrastructure Operations Leader

Vidpro Consultancy Services

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Technical Consultant â Infrastructure & Virtualisatie

HR Operations Manager

Retail Operations Coordinator - Benelux

IT Operations Lead â Ecommerce

Team Leader / LÃder de Tienda / Mall Paseo Ross

Supervisor Operations sÃ¶kes till Haninge!

Epicor Operations Consultant

DevOps Engineer - Cloud and On-Prem Infrastructure (m/f/d)