AWS TECHNICAL SERVICE MANAGEMENT
Country: Mexico
To succeed in this role you will be responsible for:
- Own and continuously improve ITIL practices for Incident Management Change Management and Problem Management for AWS-based services.
- Ensure service stability and adherence to SLAs/OLAs through operational controls service reviews and continuous improvement initiatives.
- Establish and track service health KPIs (availability incident volume MTTR/MTTA change success rate problem recurrence).
- Incident Management (incl. Major Incidents)
- Lead incident triage and coordination across cloud infrastructure platform security and application teams.
- Use Dynatrace / Cloudwatch insights (alerts traces service flow SLOs) to accelerate identification of impact scope and probable root cause domains (app vs infra vs dependencies).
- Coordinate communications and status updates during incidents ensuring timely escalation stakeholder alignment and restoration targets.
- Change Management & Governance (CAM / CAB / Committees)
- Create validate and control change requirements in ServiceNow ensuring quality of change records (scope impact risk test evidence implementation plan backout plan approvals).
- Drive the end-to-end change lifecycle: intake risk/impact analysis scheduling approvals implementation tracking post-change validation and closure.
- Prepare and present changes to CAM CAB and other change forums ensuring compliance with governance and regulatory expectations.
- Monitor change calendars/pipelines to prevent conflicts and reduce change-related incidents.
- Problem Management & Continuous Improvement
- Lead or coordinate problem investigations for recurring incidents; ensure strong root cause analysis (RCA) and corrective/preventive action plans (CAPA).
- Track action items to closure and measure effectiveness (e.g. recurrence reduction improved SLO attainment).
- Monitoring Metrics & Reporting (ServiceNow Dynatrace)
- Analyze and interpret data from ServiceNow (tickets categories backlog SLA breaches) and Dynatrace (availability/performance indicators) to detect deviations risks and trends.
- Produce weekly/monthly operational reports and dashboards: SLA compliance incident trends change success rate/failure modes top recurring issues operational risk indicators.
- Propose mitigation plans and service improvements based on evidence and measurable outcomes.
- Process Documentation and Automation Enablement
- Define and maintain operational processes and standards for cloud service operations.
- Identify opportunities for systems automation (auto-remediation workflow automation alert tuning) and partner with engineering teams to implement.
- Stakeholder Management & Cross-Team Coordination
- Act as the operational focal point between cloud teams application owners security/risk and governance stakeholders.
- Support decision-making by providing clear risk assessments impact narratives and recommended actions.
- Negotiate priorities timelines maintenance windows and resource needs across teams.
AWS TECHNICAL SERVICE MANAGEMENTCountry: MexicoTo succeed in this role you will be responsible for:Own and continuously improve ITIL practices for Incident Management Change Management and Problem Management for AWS-based services.Ensure service stability and adherence to SLAs/OLAs through operation...
AWS TECHNICAL SERVICE MANAGEMENT
Country: Mexico
To succeed in this role you will be responsible for:
- Own and continuously improve ITIL practices for Incident Management Change Management and Problem Management for AWS-based services.
- Ensure service stability and adherence to SLAs/OLAs through operational controls service reviews and continuous improvement initiatives.
- Establish and track service health KPIs (availability incident volume MTTR/MTTA change success rate problem recurrence).
- Incident Management (incl. Major Incidents)
- Lead incident triage and coordination across cloud infrastructure platform security and application teams.
- Use Dynatrace / Cloudwatch insights (alerts traces service flow SLOs) to accelerate identification of impact scope and probable root cause domains (app vs infra vs dependencies).
- Coordinate communications and status updates during incidents ensuring timely escalation stakeholder alignment and restoration targets.
- Change Management & Governance (CAM / CAB / Committees)
- Create validate and control change requirements in ServiceNow ensuring quality of change records (scope impact risk test evidence implementation plan backout plan approvals).
- Drive the end-to-end change lifecycle: intake risk/impact analysis scheduling approvals implementation tracking post-change validation and closure.
- Prepare and present changes to CAM CAB and other change forums ensuring compliance with governance and regulatory expectations.
- Monitor change calendars/pipelines to prevent conflicts and reduce change-related incidents.
- Problem Management & Continuous Improvement
- Lead or coordinate problem investigations for recurring incidents; ensure strong root cause analysis (RCA) and corrective/preventive action plans (CAPA).
- Track action items to closure and measure effectiveness (e.g. recurrence reduction improved SLO attainment).
- Monitoring Metrics & Reporting (ServiceNow Dynatrace)
- Analyze and interpret data from ServiceNow (tickets categories backlog SLA breaches) and Dynatrace (availability/performance indicators) to detect deviations risks and trends.
- Produce weekly/monthly operational reports and dashboards: SLA compliance incident trends change success rate/failure modes top recurring issues operational risk indicators.
- Propose mitigation plans and service improvements based on evidence and measurable outcomes.
- Process Documentation and Automation Enablement
- Define and maintain operational processes and standards for cloud service operations.
- Identify opportunities for systems automation (auto-remediation workflow automation alert tuning) and partner with engineering teams to implement.
- Stakeholder Management & Cross-Team Coordination
- Act as the operational focal point between cloud teams application owners security/risk and governance stakeholders.
- Support decision-making by providing clear risk assessments impact narratives and recommended actions.
- Negotiate priorities timelines maintenance windows and resource needs across teams.
View more
View less