Site Reliability Engineer (SRE) with expertise in Dynatrace monitoring log investigation and observability practices. The ideal candidate will have a deep understanding of business processes upstream-downstream dependencies and the ability to design and implement dashboards with SLOs and SLAs that align with business objec-tives.
Key Responsibilities
Monitoring Observability oConfigure and maintain Dynatrace for application and infrastructure monitoring. Develop custom dashboards alerts and reports to track system health and performance. Define and implement Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Log Analysis Troubleshooting Perform log investigation using tools like Splunk ELK or similar platforms. Identify root causes of incidents and provide actionable insights for resolution.
Business Under-standing oAnalyze business models workflows and critical application flows. Map up-stream and downstream dependencies to ensure end-to-end reliability.
Incident Man-agement Participate in on-call rotations and respond to production incidents. Drive post-incident reviews and implement preventive measures.
Automation Optimization Automated monitoring and alerting processes to reduce manual intervention. Collabo-rate with development teams to improve system reliability and performance.
Required Skills Qualifications
Technical Expertise Strong experience with Dynatrace (configura-tion dashboards problem detection). Proficiency in log analysis tools (Splunk ELK or equivalent). Solid understanding of SRE principles observability and incident man-agement.
Business Analytical Skills Ability to understand business processes and translate them into technical monitoring solutions. Experience in mapping application dependencies and creating impact analysis.
Soft Skills Excellent communication and collaboration skills. Strong problem-solving and analytical mind-set.
Preferred oExperience with cloud platforms (AWS Azure GCP). Familiarity with CICD pipelines and automation scripting.
Performance Metrics Uptime and reliability improvements. Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Accuracy and relevance of dashboards and alerts. Compliance with defined SLOs and SLAs.
Experience required: 10