Job Description: Site Reliability Engineer (SRE) Observability
Toronto - Hybrid (1-2 days office)
Role Summary
We are looking for a Observability Engineer to help implement operate and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding instrumentation dashboarding and alerting working under established standards and guidance from senior engineers.
You will collaborate with application SRE and operations teams to ensure systems are observable supportable and production-ready.
Key Responsibilities
Observability Implementation
Implement and maintain metrics logs and traces for applications and infrastructure Assist with onboarding applications into observability platforms (e.g. Dynatrace ELK Datadog) Configure dashboards alerts and basic anomaly detection Application Support & Instrumentation Work with development teams to enable structured logging basic distributed tracing and core metrics Validate observability requirements during Production Readiness Reviews (PRR) Troubleshoot missing or low-quality telemetry Monitoring & Alerting Configure alerts based on golden signals (latency errors traffic saturation) Help reduce alert noise by tuning thresholds and alert logic Support incident response by gathering logs metrics and traces Operations & Reliability Support root cause analysis using observability tools Maintain dashboards and documentation used by on-call and support teams Participate in on-call rotations (as applicable) Automation & Continuous Improvement Assist in automating observability onboarding and validation tasks Create and maintain reusable dashboards and alert templates Follow established observability standards and best practices Required Qualifications 24 years of experience in Observability or SRE Working knowledge of metrics logs and basic tracing concepts Hands-on experience with at least one observability platform (Dynatrace Elastic/ELK Datadog New Relic etc.) Basic understanding of SLIs/SLOs and service health indicators Experience with cloud platforms or hybrid environments Ability to write scripts (Python Bash PowerShell) for automation and troubleshooting
Preferred Qualifications
Experience with OpenTelemetry or APM agents Familiarity with Kubernetes or containerized workloads Experience working with incident management tools (PagerDuty ServiceNow) Exposure to Dynatrace/Kibana ELK or similar cloud-native monitoring Experience in regulated or enterprise environments
Required Skills:
Experience (Years): 4-6 Essential Skills: Work with project teams throughout the organization to design implement and manage CDN infrastructure using Akamai to ensure high availability performance and scalability for customer facing applications and business processes. Handle multiple priorities and assignments with excellence and precision. Be a part of a 24/7/365 organization (some after hours support is expected as part of normal on-call rotation). Directly support line of business development teams provide guidance to them on implementation and changes for customer facing applications Develop and maintain security protocols and measures to protect CDN infrastructure from cyber threats. Monitor and analyze network performance identifying and resolving issues to optimize content delivery of critical applications. Collaborate with cross-functional teams to integrate Akamai CDN solutions with existing systems and applications. Collaborate with information security teams to implement DDoS protection strategies and other security measures in the CDN. Provide technical support and guidance to clients and internal teams regarding CDN and security best practices. Work closely with vendor and professional service teams on delivery related activities and strategy. Qualifications: Bachelors degree in Computer Science Information Technology or a related field. OR similar work experience. Strong understanding of network protocols (HTTP/HTTPS DNS TCP/IP). Proven experience as a CDN Engineer or similar role with a strong focus on -depth knowledge of Content Delivery Network technologies including caching load balancing and content optimization. Excellent problem-solving skills and attention to detail. Strong communication and teamwork abilities. Experience supporting 24/7/365 customer facing applications at enterprise scale. Awareness and experience with cybersecurity tools and practices such as firewalls intrusion detection/prevention systems and encryption. Proficiency in scripting and automation (e.g. Python Bash) a plus. Relevant certifications (e.g. CISSP CEH) are a plus but not required.
Job Description: Site Reliability Engineer (SRE) ObservabilityToronto - Hybrid (1-2 days office)Role SummaryWe are looking for a Observability Engineer to help implement operate and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding in...
Job Description: Site Reliability Engineer (SRE) Observability
Toronto - Hybrid (1-2 days office)
Role Summary
We are looking for a Observability Engineer to help implement operate and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding instrumentation dashboarding and alerting working under established standards and guidance from senior engineers.
You will collaborate with application SRE and operations teams to ensure systems are observable supportable and production-ready.
Key Responsibilities
Observability Implementation
Implement and maintain metrics logs and traces for applications and infrastructure Assist with onboarding applications into observability platforms (e.g. Dynatrace ELK Datadog) Configure dashboards alerts and basic anomaly detection Application Support & Instrumentation Work with development teams to enable structured logging basic distributed tracing and core metrics Validate observability requirements during Production Readiness Reviews (PRR) Troubleshoot missing or low-quality telemetry Monitoring & Alerting Configure alerts based on golden signals (latency errors traffic saturation) Help reduce alert noise by tuning thresholds and alert logic Support incident response by gathering logs metrics and traces Operations & Reliability Support root cause analysis using observability tools Maintain dashboards and documentation used by on-call and support teams Participate in on-call rotations (as applicable) Automation & Continuous Improvement Assist in automating observability onboarding and validation tasks Create and maintain reusable dashboards and alert templates Follow established observability standards and best practices Required Qualifications 24 years of experience in Observability or SRE Working knowledge of metrics logs and basic tracing concepts Hands-on experience with at least one observability platform (Dynatrace Elastic/ELK Datadog New Relic etc.) Basic understanding of SLIs/SLOs and service health indicators Experience with cloud platforms or hybrid environments Ability to write scripts (Python Bash PowerShell) for automation and troubleshooting
Preferred Qualifications
Experience with OpenTelemetry or APM agents Familiarity with Kubernetes or containerized workloads Experience working with incident management tools (PagerDuty ServiceNow) Exposure to Dynatrace/Kibana ELK or similar cloud-native monitoring Experience in regulated or enterprise environments
Required Skills:
Experience (Years): 4-6 Essential Skills: Work with project teams throughout the organization to design implement and manage CDN infrastructure using Akamai to ensure high availability performance and scalability for customer facing applications and business processes. Handle multiple priorities and assignments with excellence and precision. Be a part of a 24/7/365 organization (some after hours support is expected as part of normal on-call rotation). Directly support line of business development teams provide guidance to them on implementation and changes for customer facing applications Develop and maintain security protocols and measures to protect CDN infrastructure from cyber threats. Monitor and analyze network performance identifying and resolving issues to optimize content delivery of critical applications. Collaborate with cross-functional teams to integrate Akamai CDN solutions with existing systems and applications. Collaborate with information security teams to implement DDoS protection strategies and other security measures in the CDN. Provide technical support and guidance to clients and internal teams regarding CDN and security best practices. Work closely with vendor and professional service teams on delivery related activities and strategy. Qualifications: Bachelors degree in Computer Science Information Technology or a related field. OR similar work experience. Strong understanding of network protocols (HTTP/HTTPS DNS TCP/IP). Proven experience as a CDN Engineer or similar role with a strong focus on -depth knowledge of Content Delivery Network technologies including caching load balancing and content optimization. Excellent problem-solving skills and attention to detail. Strong communication and teamwork abilities. Experience supporting 24/7/365 customer facing applications at enterprise scale. Awareness and experience with cybersecurity tools and practices such as firewalls intrusion detection/prevention systems and encryption. Proficiency in scripting and automation (e.g. Python Bash) a plus. Relevant certifications (e.g. CISSP CEH) are a plus but not required.
View more
View less