drjobs Apptad - SREOps L1/ L2 Lead

Apptad - SREOps L1/ L2 Lead

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Chicago, IL - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Google Cloud Platform (GCP) Kubernetes and Dynatrace. The ideal candidate will have hands-on experience in support projects proactive monitoring root cause analysis (RCA) and client communications. This role requires ownership of incident management dashboard creation alerting mechanisms and collaboration with cross-functional teams and external vendors.

Key Responsibilities

  • Manage and support production environments on GCP and Kubernetes clusters.
  • Monitor all critical dashboards and ensure timely alerting for production issues.
  • Conduct Root Cause Analysis (RCA) for all incidents and production issues.
  • Assist and guide the team in debugging and resolving complex production problems.
  • Lead daily standups client calls and effectively communicate status updates.
  • Track update and manage JIRA tickets related to support and incident management.
  • Represent the SRE team in all client interactions maintaining deep knowledge of ongoing tickets and issues.
  • Create and maintain alerts and dashboards for monitoring new and existing features.
  • Support onsite teams by providing insights and data from various monitoring tools.
  • Ensure compliance with Standard Operating Procedures (SOPs) related to alerts and incident handling.
  • Coordinate with external vendors in case of integration failures or outages.
  • Measure and analyze front-end performance metrics using relevant tools.
  • Advocate and enforce best practices for site reliability monitoring and incident response.

Required Skills & Experience

  • Proven experience working on support projects in a production environment.
  • Strong hands-on knowledge of Google Cloud Platform (GCP).
  • Expertise in Kubernetes cluster management and troubleshooting.
  • Proficient with Dynatrace monitoring and alerting tools.
  • Familiarity with log monitoring tools such as Splunk or Sumologic (preferred).
  • Excellent problem-solving and root cause analysis skills.
  • Experience managing incidents and maintaining dashboards.
  • Strong communication skills to handle client interactions and team coordination.
  • Ability to work collaboratively in Agile and DevOps environments.

Preferred Qualifications

  • Experience with frontend performance monitoring tools.
  • Prior exposure to multi-vendor integration support.
  • Understanding of SOPs related to incident and alert management.

Employment Type

Full-time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.