Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailThe Enterprise Observability Specialist is a mid-level position within the Unified Digital Intelligence function s Enterprise Observability team responsible for maintaining the resilience and performance of critical infrastructure applications and websites. This role involves designing and implementing Level 0 automation and Level 1 monitoring solutions conducting Level 2 advanced troubleshooting coordinating incident response and administering configuring and deploying observability tools primarily Dynatrace along with ThousandEyes Evolven and others as required. The specialist supports the event intelligence platform and collaborates with cross-functional teams to monitor on-premise and cloud environments contributing to the team s 24x7 monitoring operations. Additional responsibilities include maintaining shift operator logs and performing handover duties to ensure seamless team functionality.
Key Roles & ResponsibilitiesThe incumbent directly or through collaboration will:
Provide Level 1 support by monitoring alerts and resolving basic issues across critical websites web applications and infrastructure escalating unresolved issues to Level 2 as needed.
Perform Level 2 advanced diagnostics (e.g. log analysis performance troubleshooting) to resolve complex issues such as misconfigurations and performance bottlenecks.
Design configure and administer observability tools including Dynatrace ThousandEyes Evolven and others to ensure optimal monitoring capabilities.
Implement Level 0 automation processes (e.g. automated ticket creation and routing) to improve alert response efficiency.
Develop observability processes across infrastructure production applications and websites to ensure robust anomaly detection and situational awareness.
Integrate application and infrastructure data with the event intelligence platform and other observability tools to enable automated incident handling.
Lead incident response coordination leveraging observability data and analytics to accelerate recovery and keep stakeholders informed.
Maintain comprehensive shift operator logs documenting incidents tool configurations actions taken and escalations during assigned shifts.
Conduct clear and thorough handovers to the next shift ensuring ongoing issues system status and pending actions are communicated effectively.
Deliver training sessions for monitoring teams on configuration administration and usage of observability tools and the event intelligence platform supporting Level 1 and Level 2 functions.
Maintain thorough documentation of processes configurations and incident responses to foster knowledge sharing and reduce future downtime.
Troubleshoot application and infrastructure performance issues offering performance tuning recommendations when necessary.
Support the implementation of SLO/SLI metrics in collaboration with support teams and application owners.
Participate in shift-based operations including weekends and after-hours support as part of a rotating schedule.
Obtain or maintain a relevant Dynatrace Specialist certification (e.g. Application Performance Monitoring Specialist Infrastructure Monitoring Specialist) within 6 months of hire.
Preferred educational background includes a Bachelors or Associates degree in Technology Engineering or a related field.
4 7 years of experience in Information Systems with a focus on application and infrastructure monitoring.
Strong expertise in designing and managing observability tools particularly Dynatrace with familiarity in ThousandEyes Evolven and event intelligence platforms.
Experience in application development technologies programming languages and advanced troubleshooting.
Proficiency across multiple operating systems including Unix Linux and Windows.
Hands-on experience with observability and monitoring tools for servers applications (real and synthetic) and infrastructure (client/server/logs).
Working knowledge of scripting languages such as Perl Java or Python.
Experience with event intelligence platforms for automated incident detection and response.
Strong understanding of cloud-native monitoring strategies and technologies.
Demonstrated ability to script and automate using REST APIs and webhooks.
Practical experience monitoring applications running on Kubernetes and OpenShift clusters including service tracing and health checks.
Solid understanding of cloud infrastructure monitoring tools and environments (e.g. EC2 Lambda Azure Functions GKE EKS).
Strong analytical and troubleshooting skills to perform Level 2 fault isolation and resolution.
Excellent communication skills to support incident coordination and collaboration with cross-functional technical and operational teams.
Effective design and implementation of automation and observability processes leading to reduced recovery times and improved visibility.
Accurate detailed shift operator logs and smooth handovers supporting uninterrupted team operations.
Coordinated Level 1 and Level 2 incident response with effective integration of observability data into relevant platforms enhancing overall system reliability.
Positive feedback from monitoring teams regarding training sessions and tool adoption along with demonstrated progress toward Dynatrace certification.
Full Time