Sr. Site Reliability Engineer

Phoenix, NM - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer

Location: Phoenix AZ

(SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastructure with a strong focus on Kubernetes and integrating AI/LLM solutions into observability and operational workflows.

Key Responsibilities:

Build and operate scalable secure and highly available infrastructure in Azure and GCP.
Design and maintain observability platforms leveraging Splunk OpenTelemetry and cloud-native monitoring tools.
Develop and support AI/LLM-driven automation solutions to improve incident triage alert correlation and root cause analysis.
Partner with application and data teams to define SLOs SLIs and error budgets.
Drive operational excellence through automation chaos testing and proactive reliability improvements.
Optimize Kubernetes environments (GKE/AKS) for performance security and cost-efficiency.
Integrate observability data pipelines with LLMs for anomaly detection summarization and proactive remediation.
Participate in on-call rotations incident response and postmortem reviews.
Implement runbooks auto-remediation scripts and AI copilots for operations.

Required Qualifications:

8 years of experience as an SRE.
Strong expertise in Azure and GCP cloud platforms (certifications a plus).
Proficient in Splunk (Enterprise Observability) for monitoring alerting and log analytics.
In-depth knowledge of Kubernetes (AKS GKE) Helm and container lifecycle.
Familiarity with AI/ML and LLM-based tools (e.g. OpenAI Hugging Face Azure OpenAI) for observability or automation use cases.
Experience with CI/CD pipelines GitOps and secure deployment practices.
Programming/scripting skills in Python Go or Bash.
Strong understanding of SRE principles: SLAs SLIs SLOs error budgets and incident management.

Preferred Qualifications:

Experience building AI-enabled runbooks or copilots.
Exposure to FinOps or cost-optimization strategies in cloud environments.
Knowledge of distributed tracing and event correlation using OpenTelemetry.
Familiarity with Kafka Pub/Sub or other messaging systems for observability data.

Site Reliability Engineer Location: Phoenix AZ (SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastruc...

Site Reliability Engineer

Location: Phoenix AZ

Key Responsibilities:

Build and operate scalable secure and highly available infrastructure in Azure and GCP.
Design and maintain observability platforms leveraging Splunk OpenTelemetry and cloud-native monitoring tools.
Develop and support AI/LLM-driven automation solutions to improve incident triage alert correlation and root cause analysis.
Partner with application and data teams to define SLOs SLIs and error budgets.
Drive operational excellence through automation chaos testing and proactive reliability improvements.
Optimize Kubernetes environments (GKE/AKS) for performance security and cost-efficiency.
Integrate observability data pipelines with LLMs for anomaly detection summarization and proactive remediation.
Participate in on-call rotations incident response and postmortem reviews.
Implement runbooks auto-remediation scripts and AI copilots for operations.

Required Qualifications:

8 years of experience as an SRE.
Strong expertise in Azure and GCP cloud platforms (certifications a plus).
Proficient in Splunk (Enterprise Observability) for monitoring alerting and log analytics.
In-depth knowledge of Kubernetes (AKS GKE) Helm and container lifecycle.
Familiarity with AI/ML and LLM-based tools (e.g. OpenAI Hugging Face Azure OpenAI) for observability or automation use cases.
Experience with CI/CD pipelines GitOps and secure deployment practices.
Programming/scripting skills in Python Go or Bash.
Strong understanding of SRE principles: SLAs SLIs SLOs error budgets and incident management.

Preferred Qualifications:

Experience building AI-enabled runbooks or copilots.
Exposure to FinOps or cost-optimization strategies in cloud environments.
Knowledge of distributed tracing and event correlation using OpenTelemetry.
Familiarity with Kafka Pub/Sub or other messaging systems for observability data.