Sr. Site Reliability Engineer

ASK IT Solutions

Not Interested
Bookmark
Report This Job

profile Job Location:

Phoenix, NM - USA

profile Monthly Salary: Not Disclosed
Posted on: 2 hours ago
Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer

Location: Phoenix AZ

(SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastructure with a strong focus on Kubernetes and integrating AI/LLM solutions into observability and operational workflows.

Key Responsibilities:

  • Build and operate scalable secure and highly available infrastructure in Azure and GCP.
  • Design and maintain observability platforms leveraging Splunk OpenTelemetry and cloud-native monitoring tools.
  • Develop and support AI/LLM-driven automation solutions to improve incident triage alert correlation and root cause analysis.
  • Partner with application and data teams to define SLOs SLIs and error budgets.
  • Drive operational excellence through automation chaos testing and proactive reliability improvements.
  • Optimize Kubernetes environments (GKE/AKS) for performance security and cost-efficiency.
  • Integrate observability data pipelines with LLMs for anomaly detection summarization and proactive remediation.
  • Participate in on-call rotations incident response and postmortem reviews.
  • Implement runbooks auto-remediation scripts and AI copilots for operations.

Required Qualifications:

  • 8 years of experience as an SRE.
  • Strong expertise in Azure and GCP cloud platforms (certifications a plus).
  • Proficient in Splunk (Enterprise Observability) for monitoring alerting and log analytics.
  • In-depth knowledge of Kubernetes (AKS GKE) Helm and container lifecycle.
  • Familiarity with AI/ML and LLM-based tools (e.g. OpenAI Hugging Face Azure OpenAI) for observability or automation use cases.
  • Experience with CI/CD pipelines GitOps and secure deployment practices.
  • Programming/scripting skills in Python Go or Bash.
  • Strong understanding of SRE principles: SLAs SLIs SLOs error budgets and incident management.

Preferred Qualifications:

  • Experience building AI-enabled runbooks or copilots.
  • Exposure to FinOps or cost-optimization strategies in cloud environments.
  • Knowledge of distributed tracing and event correlation using OpenTelemetry.
  • Familiarity with Kafka Pub/Sub or other messaging systems for observability data.

Site Reliability Engineer Location: Phoenix AZ (SRE) to join Cloud Operations and Observability team. Youll be instrumental in driving resiliency performance automation and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design implement and manage infrastruc...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting