Aws Technical Service Management

Mexico City - Mexico

Monthly Salary: Not Disclosed

Posted on: Yesterday

Vacancies: 1 Vacancy

Job Summary

AWS TECHNICAL SERVICE MANAGEMENT

Country: Mexico

To succeed in this role you will be responsible for:

Own and continuously improve ITIL practices for Incident Management Change Management and Problem Management for AWS-based services.
Ensure service stability and adherence to SLAs/OLAs through operational controls service reviews and continuous improvement initiatives.
Establish and track service health KPIs (availability incident volume MTTR/MTTA change success rate problem recurrence).
Incident Management (incl. Major Incidents)
Lead incident triage and coordination across cloud infrastructure platform security and application teams.
Use Dynatrace / Cloudwatch insights (alerts traces service flow SLOs) to accelerate identification of impact scope and probable root cause domains (app vs infra vs dependencies).
Coordinate communications and status updates during incidents ensuring timely escalation stakeholder alignment and restoration targets.
Change Management & Governance (CAM / CAB / Committees)
Create validate and control change requirements in ServiceNow ensuring quality of change records (scope impact risk test evidence implementation plan backout plan approvals).
Drive the end-to-end change lifecycle: intake risk/impact analysis scheduling approvals implementation tracking post-change validation and closure.
Prepare and present changes to CAM CAB and other change forums ensuring compliance with governance and regulatory expectations.
Monitor change calendars/pipelines to prevent conflicts and reduce change-related incidents.
Problem Management & Continuous Improvement
Lead or coordinate problem investigations for recurring incidents; ensure strong root cause analysis (RCA) and corrective/preventive action plans (CAPA).
Track action items to closure and measure effectiveness (e.g. recurrence reduction improved SLO attainment).
Monitoring Metrics & Reporting (ServiceNow Dynatrace)
Analyze and interpret data from ServiceNow (tickets categories backlog SLA breaches) and Dynatrace (availability/performance indicators) to detect deviations risks and trends.
Produce weekly/monthly operational reports and dashboards: SLA compliance incident trends change success rate/failure modes top recurring issues operational risk indicators.
Propose mitigation plans and service improvements based on evidence and measurable outcomes.
Process Documentation and Automation Enablement
Define and maintain operational processes and standards for cloud service operations.
Identify opportunities for systems automation (auto-remediation workflow automation alert tuning) and partner with engineering teams to implement.
Stakeholder Management & Cross-Team Coordination
Act as the operational focal point between cloud teams application owners security/risk and governance stakeholders.
Support decision-making by providing clear risk assessments impact narratives and recommended actions.
Negotiate priorities timelines maintenance windows and resource needs across teams.

AWS TECHNICAL SERVICE MANAGEMENTCountry: MexicoTo succeed in this role you will be responsible for:Own and continuously improve ITIL practices for Incident Management Change Management and Problem Management for AWS-based services.Ensure service stability and adherence to SLAs/OLAs through operation...

AWS TECHNICAL SERVICE MANAGEMENT

Country: Mexico

To succeed in this role you will be responsible for:

Own and continuously improve ITIL practices for Incident Management Change Management and Problem Management for AWS-based services.
Ensure service stability and adherence to SLAs/OLAs through operational controls service reviews and continuous improvement initiatives.
Establish and track service health KPIs (availability incident volume MTTR/MTTA change success rate problem recurrence).
Incident Management (incl. Major Incidents)
Lead incident triage and coordination across cloud infrastructure platform security and application teams.
Use Dynatrace / Cloudwatch insights (alerts traces service flow SLOs) to accelerate identification of impact scope and probable root cause domains (app vs infra vs dependencies).
Coordinate communications and status updates during incidents ensuring timely escalation stakeholder alignment and restoration targets.
Change Management & Governance (CAM / CAB / Committees)
Create validate and control change requirements in ServiceNow ensuring quality of change records (scope impact risk test evidence implementation plan backout plan approvals).
Drive the end-to-end change lifecycle: intake risk/impact analysis scheduling approvals implementation tracking post-change validation and closure.
Prepare and present changes to CAM CAB and other change forums ensuring compliance with governance and regulatory expectations.
Monitor change calendars/pipelines to prevent conflicts and reduce change-related incidents.
Problem Management & Continuous Improvement
Lead or coordinate problem investigations for recurring incidents; ensure strong root cause analysis (RCA) and corrective/preventive action plans (CAPA).
Track action items to closure and measure effectiveness (e.g. recurrence reduction improved SLO attainment).
Monitoring Metrics & Reporting (ServiceNow Dynatrace)
Analyze and interpret data from ServiceNow (tickets categories backlog SLA breaches) and Dynatrace (availability/performance indicators) to detect deviations risks and trends.
Produce weekly/monthly operational reports and dashboards: SLA compliance incident trends change success rate/failure modes top recurring issues operational risk indicators.
Propose mitigation plans and service improvements based on evidence and measurable outcomes.
Process Documentation and Automation Enablement
Define and maintain operational processes and standards for cloud service operations.
Identify opportunities for systems automation (auto-remediation workflow automation alert tuning) and partner with engineering teams to implement.
Stakeholder Management & Cross-Team Coordination
Act as the operational focal point between cloud teams application owners security/risk and governance stakeholders.
Support decision-making by providing clear risk assessments impact narratives and recommended actions.
Negotiate priorities timelines maintenance windows and resource needs across teams.