Site Reliability Engineer
Location: Phoenix AZ
(SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastructure with a strong focus on Kubernetes and integrating AI/LLM solutions into observability and operational workflows.
Key Responsibilities:
- Build and operate scalable secure and highly available infrastructure in Azure and GCP.
- Design and maintain observability platforms leveraging Splunk OpenTelemetry and cloud-native monitoring tools.
- Develop and support AI/LLM-driven automation solutions to improve incident triage alert correlation and root cause analysis.
- Partner with application and data teams to define SLOs SLIs and error budgets.
- Drive operational excellence through automation chaos testing and proactive reliability improvements.
- Optimize Kubernetes environments (GKE/AKS) for performance security and cost-efficiency.
- Integrate observability data pipelines with LLMs for anomaly detection summarization and proactive remediation.
- Participate in on-call rotations incident response and postmortem reviews.
- Implement runbooks auto-remediation scripts and AI copilots for operations.
Required Qualifications:
- 8 years of experience as an SRE.
- Strong expertise in Azure and GCP cloud platforms (certifications a plus).
- Proficient in Splunk (Enterprise Observability) for monitoring alerting and log analytics.
- In-depth knowledge of Kubernetes (AKS GKE) Helm and container lifecycle.
- Familiarity with AI/ML and LLM-based tools (e.g. OpenAI Hugging Face Azure OpenAI) for observability or automation use cases.
- Experience with CI/CD pipelines GitOps and secure deployment practices.
- Programming/scripting skills in Python Go or Bash.
- Strong understanding of SRE principles: SLAs SLIs SLOs error budgets and incident management.
Preferred Qualifications:
- Experience building AI-enabled runbooks or copilots.
- Exposure to FinOps or cost-optimization strategies in cloud environments.
- Knowledge of distributed tracing and event correlation using OpenTelemetry.
- Familiarity with Kafka Pub/Sub or other messaging systems for observability data.
Site Reliability Engineer Location: Phoenix AZ (SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastruc...
Site Reliability Engineer
Location: Phoenix AZ
(SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastructure with a strong focus on Kubernetes and integrating AI/LLM solutions into observability and operational workflows.
Key Responsibilities:
- Build and operate scalable secure and highly available infrastructure in Azure and GCP.
- Design and maintain observability platforms leveraging Splunk OpenTelemetry and cloud-native monitoring tools.
- Develop and support AI/LLM-driven automation solutions to improve incident triage alert correlation and root cause analysis.
- Partner with application and data teams to define SLOs SLIs and error budgets.
- Drive operational excellence through automation chaos testing and proactive reliability improvements.
- Optimize Kubernetes environments (GKE/AKS) for performance security and cost-efficiency.
- Integrate observability data pipelines with LLMs for anomaly detection summarization and proactive remediation.
- Participate in on-call rotations incident response and postmortem reviews.
- Implement runbooks auto-remediation scripts and AI copilots for operations.
Required Qualifications:
- 8 years of experience as an SRE.
- Strong expertise in Azure and GCP cloud platforms (certifications a plus).
- Proficient in Splunk (Enterprise Observability) for monitoring alerting and log analytics.
- In-depth knowledge of Kubernetes (AKS GKE) Helm and container lifecycle.
- Familiarity with AI/ML and LLM-based tools (e.g. OpenAI Hugging Face Azure OpenAI) for observability or automation use cases.
- Experience with CI/CD pipelines GitOps and secure deployment practices.
- Programming/scripting skills in Python Go or Bash.
- Strong understanding of SRE principles: SLAs SLIs SLOs error budgets and incident management.
Preferred Qualifications:
- Experience building AI-enabled runbooks or copilots.
- Exposure to FinOps or cost-optimization strategies in cloud environments.
- Knowledge of distributed tracing and event correlation using OpenTelemetry.
- Familiarity with Kafka Pub/Sub or other messaging systems for observability data.
View more
View less