Site Reliability Engineer
Deerfield IL(Onsite)
Fulltime
Job Description
Must Have Technical/Functional Skills
- 7 years of experience in SRE platform engineering or cloud infrastructure engineering in large-scale enterprise environments (10000 employees or equivalent complexity).
- Deep hands-on expertise with Microsoft Azure - minimum 4 years in a primary Azure cloud engineering role.
- Expert-level proficiency with AKS: cluster lifecycle management RBAC network policies pod security standards cluster autoscaler and Workload Identity.
- Strong infrastructure-as-code skills: Terraform (required) and/or Bicep; experience managing Azure Landing Zones or Enterprise-Scale architecture.
- Proficiency in at least one systems programming/scripting language: Python (preferred) Go or PowerShell.
- Experience designing and operating enterprise observability platforms using Azure Monitor Log Analytics and Application Insights at scale.
- Demonstrable track record of owning SLOs/SLIs and delivering measurable reliability improvements in production.
- Strong knowledge of enterprise networking in Azure: Hub-and-Spoke/Virtual WAN ExpressRoute Azure Firewall NSGs Private Endpoints and DNS Private Zones.
Required/Preferred Certifications:
- AZ-104 AZ-305 (Preferred) AZ-400 (Preferred) CKA ITIL v4 Foundation
Roles & Responsibilities
Reliability & Availability Engineering
- Define own and enforce enterprise-wide SLOs SLIs and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly.
- Lead architectural reviews for new services and ensure reliability non-functionals (availability targets RTO/RPO) are embedded from design through to production.
- Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks.
- Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions. Incident Management & On-Call
- Serve as Incident Commander for P1/P2 major incidents own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR).
- Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of < 5 minutes for Tier-0 services.
- Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA.
Observability & Platform Engineering
- Design and operate the enterprise observability stack: Azure Monitor Log Analytics Workspaces Application Insights and Azure Managed Grafana; ensure full MELT (Metrics Events Logs Traces) coverage.
- Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow.
- Develop and operate platform automation runbooks and self-healing capabilities using Azure Automation Logic Apps and Python/PowerShell scripting.
CI/CD & Infrastructure Reliability
- Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines ; automated performance testing synthetic monitoring and progressive deployment (canary/blue-green) strategies.
- Manage reliability of AKS clusters across multiple Azure regions own node pool scaling upgrade strategy and cluster hardening in alignment with CIS Benchmarks.
- Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones.
Thanks & Regards
Chandragupt Shivam
Ph : EXT-3188
Synchrony Systems Inc.
Disclaimer: We respect your online privacy. If you would like to be removed from our mailing list please reply with Remove in the subject and we will comply immediately. We apologize for any inconvenience caused. Please let us know if you have more than one domain. The material in this e-mail is intended only for the use of the individual to whom it is addressed and may contain information that is confidential privileged and exempt from disclosure under applicable law. If you are not the intended recipient be advised that the unauthorized use disclosure copying distribution or the taking of any action in reliance on this information is strictly prohibited. We are an equal opportunity employer with a diverse workforce. Note : Any resume submitted by Synchrony Systems Inc is presented with the understanding that the candidate is being considered for your direct end-client (end-client is the company where the work will be performed). If there is any other company involved between the end-client and your company please do not submit this resume without our written approval. If you submit the resume to another third party Synchrony Systems Inc reserves the right to work with the third party directly.
Donate RED Spread GREEN Save BLUE
Site Reliability Engineer Deerfield IL(Onsite) Fulltime Job Description Must Have Technical/Functional Skills 7 years of experience in SRE platform engineering or cloud infrastructure engineering in large-scale enterprise environments (10000 employees or equivalent complexity). Deep hands-o...
Site Reliability Engineer
Deerfield IL(Onsite)
Fulltime
Job Description
Must Have Technical/Functional Skills
- 7 years of experience in SRE platform engineering or cloud infrastructure engineering in large-scale enterprise environments (10000 employees or equivalent complexity).
- Deep hands-on expertise with Microsoft Azure - minimum 4 years in a primary Azure cloud engineering role.
- Expert-level proficiency with AKS: cluster lifecycle management RBAC network policies pod security standards cluster autoscaler and Workload Identity.
- Strong infrastructure-as-code skills: Terraform (required) and/or Bicep; experience managing Azure Landing Zones or Enterprise-Scale architecture.
- Proficiency in at least one systems programming/scripting language: Python (preferred) Go or PowerShell.
- Experience designing and operating enterprise observability platforms using Azure Monitor Log Analytics and Application Insights at scale.
- Demonstrable track record of owning SLOs/SLIs and delivering measurable reliability improvements in production.
- Strong knowledge of enterprise networking in Azure: Hub-and-Spoke/Virtual WAN ExpressRoute Azure Firewall NSGs Private Endpoints and DNS Private Zones.
Required/Preferred Certifications:
- AZ-104 AZ-305 (Preferred) AZ-400 (Preferred) CKA ITIL v4 Foundation
Roles & Responsibilities
Reliability & Availability Engineering
- Define own and enforce enterprise-wide SLOs SLIs and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly.
- Lead architectural reviews for new services and ensure reliability non-functionals (availability targets RTO/RPO) are embedded from design through to production.
- Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks.
- Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions. Incident Management & On-Call
- Serve as Incident Commander for P1/P2 major incidents own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR).
- Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of < 5 minutes for Tier-0 services.
- Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA.
Observability & Platform Engineering
- Design and operate the enterprise observability stack: Azure Monitor Log Analytics Workspaces Application Insights and Azure Managed Grafana; ensure full MELT (Metrics Events Logs Traces) coverage.
- Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow.
- Develop and operate platform automation runbooks and self-healing capabilities using Azure Automation Logic Apps and Python/PowerShell scripting.
CI/CD & Infrastructure Reliability
- Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines ; automated performance testing synthetic monitoring and progressive deployment (canary/blue-green) strategies.
- Manage reliability of AKS clusters across multiple Azure regions own node pool scaling upgrade strategy and cluster hardening in alignment with CIS Benchmarks.
- Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones.
Thanks & Regards
Chandragupt Shivam
Ph : EXT-3188
Synchrony Systems Inc.
Disclaimer: We respect your online privacy. If you would like to be removed from our mailing list please reply with Remove in the subject and we will comply immediately. We apologize for any inconvenience caused. Please let us know if you have more than one domain. The material in this e-mail is intended only for the use of the individual to whom it is addressed and may contain information that is confidential privileged and exempt from disclosure under applicable law. If you are not the intended recipient be advised that the unauthorized use disclosure copying distribution or the taking of any action in reliance on this information is strictly prohibited. We are an equal opportunity employer with a diverse workforce. Note : Any resume submitted by Synchrony Systems Inc is presented with the understanding that the candidate is being considered for your direct end-client (end-client is the company where the work will be performed). If there is any other company involved between the end-client and your company please do not submit this resume without our written approval. If you submit the resume to another third party Synchrony Systems Inc reserves the right to work with the third party directly.
Donate RED Spread GREEN Save BLUE
View more
View less