Senior Site Reliability Engineer

Randstad India

Not Interested
Bookmark
Report This Job

profile Job Location:

Bengaluru - India

profile Monthly Salary: Not Disclosed
Posted on: 4 hours ago
Vacancies: 1 Vacancy

Job Summary

AI- SRE
Job Description:
Oversee the reliability availability and performance of enterprise AI platforms including
Azure Machine Learning Azure AI Foundry AWS SageMaker Microsoft Pilot Studio
LangSmith and other cloud or on-premises AI solutions.
Monitor and optimise the reliability of data pipelines supporting AI workloads proactively
identifying and resolving bottlenecks or failures.
Set up and manage alerting mechanisms for outages model failures data pipeline
disruptions and service degradations across multi-cloud and hybrid environments.
Lead incident response for platform model or data pipeline issues driving rapid
resolution and root cause analysis.
Establish track and report on Service Level Agreements (SLAs) and uptime targets for AI
services collaborating with platform owners and business stakeholders.
Implement and optimise redundancy auto-scaling and failover strategies to meet or
exceed SLAs for AI workloads.
Ensure compliance with security privacy and governance requirements for AI workloads
including audit logging access controls and regulatory monitoring.
Assist teams with onboarding environment setup and integration of AI workloads onto
supported platforms.
Act as a support escalation point for complex deployment infrastructure and runtime
issues collaborating with platform vendors as needed.
Contribute to platform documentation runbooks and knowledge sharing to drive
operational excellence.
Requirements:
Familiarity with one or more enterprise AI platforms (e.g. Azure ML Azure AI Foundry
AWS SageMaker Microsoft Pilot Studio LangSmith).
Strong understanding of SLAs uptime targets and implementation of redundancy and
scaling policies for AI services.
Familiarity with onboarding and supporting teams in production AI environments.
Excellent troubleshooting and problem-solving skills especially in debugging complex
deployment data and runtime issues.
Knowledge of cloud security compliance and governance best practices for AI
workloads.
AI- SRE Job Description: Oversee the reliability availability and performance of enterprise AI platforms including Azure Machine Learning Azure AI Foundry AWS SageMaker Microsoft Pilot Studio LangSmith and other cloud or on-premises AI solutions. Monitor and optimise the reliability of data ...
View more view more