We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Google Cloud Platform (GCP) Kubernetes and Dynatrace. The ideal candidate will have hands-on experience in support projects proactive monitoring root cause analysis (RCA) and client communications. This role requires ownership of incident management dashboard creation alerting mechanisms and collaboration with cross-functional teams and external vendors.
Key Responsibilities
- Manage and support production environments on GCP and Kubernetes clusters.
- Monitor all critical dashboards and ensure timely alerting for production issues.
- Conduct Root Cause Analysis (RCA) for all incidents and production issues.
- Assist and guide the team in debugging and resolving complex production problems.
- Lead daily standups client calls and effectively communicate status updates.
- Track update and manage JIRA tickets related to support and incident management.
- Represent the SRE team in all client interactions maintaining deep knowledge of ongoing tickets and issues.
- Create and maintain alerts and dashboards for monitoring new and existing features.
- Support onsite teams by providing insights and data from various monitoring tools.
- Ensure compliance with Standard Operating Procedures (SOPs) related to alerts and incident handling.
- Coordinate with external vendors in case of integration failures or outages.
- Measure and analyze front-end performance metrics using relevant tools.
- Advocate and enforce best practices for site reliability monitoring and incident response.
Required Skills & Experience
- Proven experience working on support projects in a production environment.
- Strong hands-on knowledge of Google Cloud Platform (GCP).
- Expertise in Kubernetes cluster management and troubleshooting.
- Proficient with Dynatrace monitoring and alerting tools.
- Familiarity with log monitoring tools such as Splunk or Sumologic (preferred).
- Excellent problem-solving and root cause analysis skills.
- Experience managing incidents and maintaining dashboards.
- Strong communication skills to handle client interactions and team coordination.
- Ability to work collaboratively in Agile and DevOps environments.
Preferred Qualifications
- Experience with frontend performance monitoring tools.
- Prior exposure to multi-vendor integration support.
- Understanding of SOPs related to incident and alert management.