Desirable Skills:
Site Reliability Engineering (SRE)
Amazon Web Service (AWS) Cloud Computing
Role Descriptions:
Design implement and maintain highly available and scalable systems on AWS.
Develop and manage CI/CD pipelines for automated deployments and testing.
Configure and optimize Dynatrace monitoring for application performance and infrastructure health.
Implement observability practices (metrics logging tracing) to improve system reliability.
Collaborate with development and operations teams to automate processes and reduce manual interventions.
Perform incident management root cause analysis and drive continuous improvement.
Ensure security compliance and cost optimization in cloud environments.
Required Skills and Qualifications
Strong experience with AWS services (EC2 S3 RDS Lambda VPC IAM CloudWatch).
Hands-on expertise in Dynatrace for application and infrastructure monitoring.
Proficiency in CI/CD tools (Jenkins GitLab CI Azure DevOps or similar).
Knowledge of Infrastructure as Code (IaC) tools (Terraform AWS CloudFormation).
Experience with containerization and orchestration (Docker Kubernetes).
Familiarity with scripting languages (Python and Bash).
Solid understanding of SRE principles SLIs SLOs and error budgets.
Required Skills:
Essential Skills Site Reliability Engineer (SRE) Amazon Web Service (AWS) Cloud Computing Github Enterprise Role Descriptions Site Reliability Engineer (SRE) with expertise in Dynatrace monitoring log investigation and observability practices. The ideal candidate will have a deep understanding of business processes upstream-downstream dependencies and the ability to design and implement dashboards with SLOs and SLAs that align with business objectives. Key Responsibilities Monitoring Observability Configure and maintain Dynatrace for application and infrastructure monitoring. Develop custom dashboards alerts and reports to track system health and performance. Define and implement Service Level Objectives (SLOs) and Service Level Agreements (SLAs). Log Analysis Troubleshooting Perform log investigation using tools like Splunk ELK or similar platforms. Identify root causes of incidents and provide actionable insights for resolution. Business Understanding. Analyze business models workflows and critical application flows. Map upstream and downstream dependencies to ensure end-to-end reliability. Incident Management Participate in on-call rotations and respond to production incidents. Drive post-incident reviews and implement preventive measures. Automation Optimization Automate monitoring and alerting processes to reduce manual intervention. Collaborate with development teams to improve system reliability and performance. Required Skills and Qualifications Technical Expertise Strong experience with Dynatrace (configuration dashboards and problem detection). Proficiency in log analysis tools (Splunk ELK or equivalent). Solid understanding of SRE principles observability and incident management. Business Analytical Skills Ability to understand business processes and translate them into technical monitoring solutions. Experience in mapping application dependencies and creating impact analysis. Soft Skills Excellent communication and collaboration skills. Strong problem-solving and analytical mindset. Preferred Experience Experience with Cloud platforms (AWS Azure GCP). Familiarity with CI/CD pipelines and automation scripting. Performance Metrics Uptime and reliability improvements. Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Accuracy and relevance of dashboards and alerts. Compliance with defined SLOs and SLAs. Experience Required: 8-10 years
IT Services and IT Consulting