Why 360training At 360training were more than just a leader in online trainingwere helping people unlock their potential and shape their futures. For over two decades weve empowered millions of learners with regulatory-approved training across industries making it possible for individuals to get the jobs they want and keep the careers they love. Our success is built on two simple but powerful values: Deliver Results and Do the Right Thing. Theyre not just words on a wallthey guide how we work collaborate and grow together. At 360training youll join a passionate team that tests in your development rewards your results and supports you personally and professionally. If youre looking for a career where you can make an impact grow quickly and be valued every step of the waythis is your chance. Site Reliability & Observability Engineer 360training is seeking a Site Reliability & Observability Engineer to build and scale our observability and reliability practices across cloud container and application environments. This role will be responsible for developing the systems tools and processes that ensure application performance reliability and visibility across multiple platforms and brands. The SRE will partner closely with DevOps and Development teams to define service-level objectives (SLOs) establish automated monitoring and alerting and drive performance optimization across infrastructure and applications. This individual will also play a critical role in incident response postmortems and the ongoing evolution toward a data-driven reliability culture. Our ideal candidate is a hands-on engineer with experience in application performance monitoring (APM) metrics tracing and logging and a strong background in automation and cloud-native observability tooling. Key Responsibilities Observability Platform Development - Design implement and manage the enterprise-wide observability stack (APM metrics logs and traces) across Azure and containerized workloads.
- Deploy and maintain monitoring tools to ensure full-stack visibility.
- Build standardized dashboards alerts and KPIs for key services and business applications.
- Develop and maintain automation for telemetry data collection alert configuration and dashboard provisioning.
- Ensure coverage for application infrastructure and end-user experience monitoring across all environments.
Reliability Engineering - Define and maintain Service-Level Objectives (SLOs) Service-Level Indicators (SLIs) and Error Budgets in partnership with DevOps and Development teams.
- Implement automated incident detection alerting and response playbooks to reduce MTTR.
- Analyze recurring incidents and drive permanent fixes and reliability improvements.
- Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages.
Performance & Cost Optimization - Establish performance baselines and track resource utilization across cloud and container infrastructure.
- Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations.
- Monitor and optimize monitoring metrics ingestion Azure Log Analytics and storage costs to balance visibility with efficiency.
Incident Management & Postmortems - Serve as a key responder during major incidents providing data-driven insights and remediation coordination.
- Lead root cause analysis (RCA) and ensure postmortem action items are implemented.
- Build dashboards and analytics to identify leading indicators of failure and performance degradation.
- Improve operational playbooks to accelerate detection and recovery.
Automation & Continuous Improvement - Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring.
- Continuously evaluate emerging observability tools and practices for adoption.
- Advocate for reliability and monitoring best practices across engineering teams.
Required Skills - 5 years of experience in Site Reliability Observability or DevOps Engineering roles.
- Strong hands-on experience with observability tools such as Datadog New Relic Grafana ELK/EFK or equivalent.
- Deep understanding of metrics tracing and logging concepts and their correlation across distributed systems.
- Experience implementing Synthetics and RUM monitoring for frontend performance.
- Experience defining and managing SLOs SLIs and Error Budgets.
- Solid grasp of Azure infrastructure Kubernetes (AKS) and container monitoring.
- Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows.
- Excellent analytical and communication skills; able to translate complex data into actionable insights.
Preferred Skills - Understanding of distributed tracing in microservice architectures.
- Experience with fronted website performance tuning/optimization based on core web vitals
- Strong scripting and automation skills (Python PowerShell or Bash).
- Experience with incident management and RCA processes.
|
Why 360trainingAt 360training were more than just a leader in online trainingwere helping people unlock their potential and shape their futures. For over two decades weve empowered millions of learners with regulatory-approved training across industries making it possible for individuals to get the ...
Why 360training At 360training were more than just a leader in online trainingwere helping people unlock their potential and shape their futures. For over two decades weve empowered millions of learners with regulatory-approved training across industries making it possible for individuals to get the jobs they want and keep the careers they love. Our success is built on two simple but powerful values: Deliver Results and Do the Right Thing. Theyre not just words on a wallthey guide how we work collaborate and grow together. At 360training youll join a passionate team that tests in your development rewards your results and supports you personally and professionally. If youre looking for a career where you can make an impact grow quickly and be valued every step of the waythis is your chance. Site Reliability & Observability Engineer 360training is seeking a Site Reliability & Observability Engineer to build and scale our observability and reliability practices across cloud container and application environments. This role will be responsible for developing the systems tools and processes that ensure application performance reliability and visibility across multiple platforms and brands. The SRE will partner closely with DevOps and Development teams to define service-level objectives (SLOs) establish automated monitoring and alerting and drive performance optimization across infrastructure and applications. This individual will also play a critical role in incident response postmortems and the ongoing evolution toward a data-driven reliability culture. Our ideal candidate is a hands-on engineer with experience in application performance monitoring (APM) metrics tracing and logging and a strong background in automation and cloud-native observability tooling. Key Responsibilities Observability Platform Development - Design implement and manage the enterprise-wide observability stack (APM metrics logs and traces) across Azure and containerized workloads.
- Deploy and maintain monitoring tools to ensure full-stack visibility.
- Build standardized dashboards alerts and KPIs for key services and business applications.
- Develop and maintain automation for telemetry data collection alert configuration and dashboard provisioning.
- Ensure coverage for application infrastructure and end-user experience monitoring across all environments.
Reliability Engineering - Define and maintain Service-Level Objectives (SLOs) Service-Level Indicators (SLIs) and Error Budgets in partnership with DevOps and Development teams.
- Implement automated incident detection alerting and response playbooks to reduce MTTR.
- Analyze recurring incidents and drive permanent fixes and reliability improvements.
- Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages.
Performance & Cost Optimization - Establish performance baselines and track resource utilization across cloud and container infrastructure.
- Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations.
- Monitor and optimize monitoring metrics ingestion Azure Log Analytics and storage costs to balance visibility with efficiency.
Incident Management & Postmortems - Serve as a key responder during major incidents providing data-driven insights and remediation coordination.
- Lead root cause analysis (RCA) and ensure postmortem action items are implemented.
- Build dashboards and analytics to identify leading indicators of failure and performance degradation.
- Improve operational playbooks to accelerate detection and recovery.
Automation & Continuous Improvement - Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring.
- Continuously evaluate emerging observability tools and practices for adoption.
- Advocate for reliability and monitoring best practices across engineering teams.
Required Skills - 5 years of experience in Site Reliability Observability or DevOps Engineering roles.
- Strong hands-on experience with observability tools such as Datadog New Relic Grafana ELK/EFK or equivalent.
- Deep understanding of metrics tracing and logging concepts and their correlation across distributed systems.
- Experience implementing Synthetics and RUM monitoring for frontend performance.
- Experience defining and managing SLOs SLIs and Error Budgets.
- Solid grasp of Azure infrastructure Kubernetes (AKS) and container monitoring.
- Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows.
- Excellent analytical and communication skills; able to translate complex data into actionable insights.
Preferred Skills - Understanding of distributed tracing in microservice architectures.
- Experience with fronted website performance tuning/optimization based on core web vitals
- Strong scripting and automation skills (Python PowerShell or Bash).
- Experience with incident management and RCA processes.
|
View more
View less