Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via email1. Role Overview
The Operations & Service Support Manager ensures 24 7 operational excellence and customer satisfaction for our cloud infrastructure offerings including GPUaccelerated compute solutions. This role oversees daytoday operations manages support teams (Tier 2 3) and collaborates closely with product and engineering teams to maintain high availability performance and robust service for enterprise customers running AI HPC or other missioncritical workloads.
2. Key Responsibilities
1. Operational Oversight & Service Management
o Lead and coordinate daily operations in multicloud or hybrid environments (e.g. AWS Azure GCP onprem HPC).
o Maintain operational dashboards (uptime ticket volumes SLAs) and proactively address performance or capacity bottlenecks.
o Ensure adherence to ITIL or other standard frameworks for incident change and problem management.
2. Team Leadership & Support Structure
o Manage Support Tiers ( L2 L3) and operations staff (NOC monitoring specialists) to handle escalations incident triage and rootcause analysis.
o Set clear KPIs and SOPs for the team focusing on quick resolution times high first contact resolution rates and continuous improvements.
o Coordinate training runbooks and knowledge transfer to ensure each tier has the expertise needed for AI/GPU workloads.
3. Incident & Problem Management
o Oversee major incidents and ensure timely resolution for critical outages or severe performance degradations especially in GPU based clusters.
o Chair regular post incident reviews (RCAs) track corrective actions and drive improvements to reduce recurrence.
o Maintain strong collaboration with product and SRE/engineering teams to address underlying code or architectural issues.
4. Service Assurance & Continuous Improvement
o Proactively monitor metrics and logs (e.g. GPU utilization HPC job performance cost anomalies) to spot potential issues before they escalate.
o Drive automation initiatives (in partnership with DevOps or SRE) to reduce manual toil improve deployment flows and streamline maintenance tasks.
o Champion reliability best practices and riskmitigation strategies aligned with organizational SLAs and error budgets.
5. Stakeholder & Customer Engagement
o Act as a liaison between support/ops teams and key customers ensuring visibility into operational performance and planned maintenance windows.
o Support customer success teams by providing insights on usage trends capacity needs and support ticket data.
o Escalate customer concerns and feedback to the product roadmap when recurring patterns emerge.
6. Resource & Vendor Management
o Manage relationships with external vendors and partners (e.g. GPU hardware providers colocation/DC hosts cloud service providers).
o Ensure optimal resource allocation whether GPU nodes highspeed storage or other HPC components to meet service demands and cost targets.
o Track operational budgets negotiate contracts and control OPEX/CAPEX in alignment with company goals.
7. Compliance & Security
o Implement and enforce security policies (access controls patching vulnerability management) for HPC/GPU clusters and cloud environments.
o Work with InfoSec teams to maintain compliance (SOC 2 ISO 27001 etc.) and manage data governance or audit requirements.
3. Qualifications & Skills
1. Education & Experience
o Bachelor s or Master s in Computer Science Engineering or related field.
o 8 years in operations / support management roles with 3 years in cloud infrastructure or HPC/AI environments.
2. Technical & Domain Expertise
o Strong understanding of cloud computing concepts (IaaS PaaS containers virtualization) and GPUaccelerated computing (NVIDIA GPUs HPC schedulers).
o Familiarity with infrastructure automation (Terraform Ansible) and observability tools (Prometheus Grafana Datadog etc.).
o Knowledge of distributed systems HPC clusters performance tuning and relevant DevOps/SRE practices.
3. Operations Management
o Proven track record implementing ITIL / ITSM frameworks for incident change and problem management at scale.
o Experience running 24 7 support teams establishing SLAs and delivering on operational KPIs.
4. Leadership & Communication
o Excellent peoplemanagement and coaching skills able to motivate diverse teams across geographies and time zones.
o Strong communication capable of engaging both senior executive stakeholders and front line support engineers.
o Adept at crisis management; calm under pressure and systematic in escalation protocols.
team leadership,operations management,communication skills,operations,infrastructure,performance tuning,crisis management,gpu-accelerated computing,cloud computing,observability tools,cloud,infrastructure automation,itil,multicloud environments,incident management
Full Time