SRE Lead – Observability
Job Summary
Job Title: Senior Platform Engineer / Senior SRE Developer Observability (Dynatrace)
Location: Toronto ON
Work Style: Hybrid (2 days per week in-person at Toronto office preferred
Skills: Digital : PythonDigital : Platform System (APS)
Experience Required: 8-10
Role Descriptions:
SRE Lead
Deep application and system level knowledge across complex end to end environments including tightly integrated on prem and cloud native services supporting large scale multi tier transaction flows
Prior hands on experience with APM and observability platforms (Dynatrace or comparable enterprise tools) with the ability to instrument analyze and troubleshoot complex distributed applications
Proven expertise in deep troubleshooting across multi layer end to end (E2E) environments including application infrastructure network and platform layers (on prem and cloud)
Drive and execute the SRE / WCCS roadmap for BMO
Hands on role from Day 1
Strong observability experience (refer to Observability SME expectations below)
Deep knowledge and experience implementing SRE practices and guiding complex SRE transformations across the industry
Key Contributions:
Assess current SRE capabilities identify gaps and contribute to the SRE & WCCS roadmap
Navigate and collaborate across multi team SRE and IT Operations environments to drive results
Deliver creative workarounds and practical solutions to complex problems
SRE Observability SME
Hands on role from Day 1
Strong Day 1 Dynatrace expertise including:
o DQL
o Gen3 Dashboards
o Traces / Grail
o Active Gate and Plugins
o SRG / Workflow development
o BizEvents
Prior hands on experience with APM and observability platforms (Dynatrace or equivalent) with the ability to instrument analyze and troubleshoot distributed applications
Deep troubleshooting expertise using observability signals (Metrics Events Logs Traces) to identify root causes across complex multi layer E2E environments
Strong foundation in Observability fundamentals (MELT)
Expert level dashboard design including UI/UX best practices
Extensive experience troubleshooting performance and non functional issues
Familiarity with SRE concepts as outlined in the Google SRE book/workbook
Strong expertise in AWS Observability including:
o CloudWatch
o Application Signals
o Metrics Logs and Traces
o Lambda and API Gateway
Ability to design creative monitoring solutions for platforms with limited observability (e.g. IBM DataPower)
Development experience with Python AWS Lambda ECS and Azure Functions
Understanding of AI based system fundamentals including how such systems are built and monitored
Background or working knowledge of OpenTelemetry (OTEL)
Experience in Financial Services or equivalent highly complex environments (e.g. 50 systems collaborating to fulfill a single customer transaction)
Required Skills:
Sailpoint