Job Title: DevOps Engineer Location: Atlanta GA Birmingham AL Louisville KY Richmond VA Charlotte NC Can do Only W2 No C2C
Job Summary:
We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management Change Control Error Budgeting Remediation and Production Operations. The ideal candidate will be responsible for ensuring the reliability scalability performance and operational excellence of cloud-native platforms and distributed systems. This role requires deep expertise in cloud infrastructure automation observability incident response and operational governance.
Key Responsibilities:
Manage and improve platform reliability availability and performance across production environments.
Lead and participate in incident management root cause analysis remediation planning and post-incident reviews.
Drive change control processes and ensure operational governance standards are followed.
Monitor and manage error budgets while implementing reliability improvements.
Design build and maintain scalable cloud infrastructure and automation frameworks.
Deploy and manage containerized applications using Kubernetes and Docker.
Develop and maintain CI/CD pipelines to support efficient software delivery.
Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management.
Establish observability strategies using monitoring logging and alerting platforms.
Collaborate with development infrastructure security and business teams to ensure platform stability.
Troubleshoot complex production issues across cloud networking infrastructure and application layers.
Continuously improve operational processes automation and system resilience.
Required Skills:
7 years of experience in Site Reliability Engineering (SRE) DevOps Cloud Infrastructure or Production Operations.
Strong experience managing workloads in cloud environments:
Microsoft Azure
Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Hands-on experience with:
Kubernetes
Docker
CI/CD Pipelines
Infrastructure as Code (IaC)
Strong scripting and automation expertise using:
Python
Bash
PowerShell
Go (Golang)
Experience with observability and monitoring platforms:
Datadog
Grafana
Prometheus
Splunk
Strong understanding of:
Networking concepts
Linux Administration
Windows Administration
Distributed Systems
Cloud-Native Architectures
Experience with:
Incident Response
Production Troubleshooting
Operational Governance
Preferred Qualifications:
Experience implementing reliability engineering best practices and SRE methodologies.
Experience supporting large-scale enterprise production environments.
Familiarity with high-availability and disaster recovery architectures.
Experience automating operational workflows and infrastructure management.
Knowledge of security best practices within cloud environments.
Experience working in Agile and DevOps-driven organizations.
Mandatory Skills: Site Reliability Engineering (SRE) Incident Management Change Control Error Budgeting Production Remediation Microsoft Azure AWS GCP Kubernetes Docker CI/CD Pipelines Infrastructure as Code (IaC) Python Bash PowerShell Go (Golang) Datadog Grafana Prometheus Splunk Linux Administration Windows Administration Networking Distributed Systems Cloud-Native Architectures Production Troubleshooting Operational Governance
Job Title: DevOps Engineer Location: Atlanta GA Birmingham AL Louisville KY Richmond VA Charlotte NC Can do Only W2 No C2C Job Summary: We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management Change Control Error Budgeting Rem...
Job Title: DevOps Engineer Location: Atlanta GA Birmingham AL Louisville KY Richmond VA Charlotte NC Can do Only W2 No C2C
Job Summary:
We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management Change Control Error Budgeting Remediation and Production Operations. The ideal candidate will be responsible for ensuring the reliability scalability performance and operational excellence of cloud-native platforms and distributed systems. This role requires deep expertise in cloud infrastructure automation observability incident response and operational governance.
Key Responsibilities:
Manage and improve platform reliability availability and performance across production environments.
Lead and participate in incident management root cause analysis remediation planning and post-incident reviews.
Drive change control processes and ensure operational governance standards are followed.
Monitor and manage error budgets while implementing reliability improvements.
Design build and maintain scalable cloud infrastructure and automation frameworks.
Deploy and manage containerized applications using Kubernetes and Docker.
Develop and maintain CI/CD pipelines to support efficient software delivery.
Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management.
Establish observability strategies using monitoring logging and alerting platforms.
Collaborate with development infrastructure security and business teams to ensure platform stability.
Troubleshoot complex production issues across cloud networking infrastructure and application layers.
Continuously improve operational processes automation and system resilience.
Required Skills:
7 years of experience in Site Reliability Engineering (SRE) DevOps Cloud Infrastructure or Production Operations.
Strong experience managing workloads in cloud environments:
Microsoft Azure
Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Hands-on experience with:
Kubernetes
Docker
CI/CD Pipelines
Infrastructure as Code (IaC)
Strong scripting and automation expertise using:
Python
Bash
PowerShell
Go (Golang)
Experience with observability and monitoring platforms:
Datadog
Grafana
Prometheus
Splunk
Strong understanding of:
Networking concepts
Linux Administration
Windows Administration
Distributed Systems
Cloud-Native Architectures
Experience with:
Incident Response
Production Troubleshooting
Operational Governance
Preferred Qualifications:
Experience implementing reliability engineering best practices and SRE methodologies.
Experience supporting large-scale enterprise production environments.
Familiarity with high-availability and disaster recovery architectures.
Experience automating operational workflows and infrastructure management.
Knowledge of security best practices within cloud environments.
Experience working in Agile and DevOps-driven organizations.
Mandatory Skills: Site Reliability Engineering (SRE) Incident Management Change Control Error Budgeting Production Remediation Microsoft Azure AWS GCP Kubernetes Docker CI/CD Pipelines Infrastructure as Code (IaC) Python Bash PowerShell Go (Golang) Datadog Grafana Prometheus Splunk Linux Administration Windows Administration Networking Distributed Systems Cloud-Native Architectures Production Troubleshooting Operational Governance