DevOps Engineer
Job Location:
Dallas, TX - USA
Monthly Salary:
Not Disclosed
Posted on:
2 days ago
Vacancies:
1 Vacancy
Job Summary
Job Title: DevOps Engineer
Location: Dallas TX
Can do Only w2 No C2C
Job Summary:
- We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management Change Control Error Budgeting Remediation and Production Operations.
- The ideal candidate will be responsible for ensuring the reliability scalability performance and operational excellence of cloud-native platforms and distributed systems.
- This role requires deep expertise in cloud infrastructure automation observability incident response and operational governance.
Key Responsibilities:
- Manage and improve platform reliability availability and performance across production environments.
- Lead and participate in incident management root cause analysis remediation planning and post-incident reviews.
- Drive change control processes and ensure operational governance standards are followed.
- Monitor and manage error budgets while implementing reliability improvements.
- Design build and maintain scalable cloud infrastructure and automation frameworks.
- Deploy and manage containerized applications using Kubernetes and Docker.
- Develop and maintain CI/CD pipelines to support efficient software delivery.
- Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management.
- Establish observability strategies using monitoring logging and alerting platforms.
- Collaborate with development infrastructure security and business teams to ensure platform stability.
- Troubleshoot complex production issues across cloud networking infrastructure and application layers.
- Continuously improve operational processes automation and system resilience.
Required Skills:
- 7 years of experience in Site Reliability Engineering (SRE) DevOps Cloud Infrastructure or Production Operations.
- Strong experience managing workloads in cloud environments:
- Microsoft Azure
- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Hands-on experience with:
- Kubernetes
- Docker
- CI/CD Pipelines
- Infrastructure as Code (IaC)
- Strong scripting and automation expertise using:
- Python
- Bash
- PowerShell
- Go (Golang)
- Experience with observability and monitoring platforms:
- Datadog
- Grafana
- Prometheus
- Splunk
- Strong understanding of:
- Networking concepts
- Linux Administration
- Windows Administration
- Distributed Systems
- Cloud-Native Architectures
- Experience with:
- Incident Response
- Production Troubleshooting
- Operational Governance
Preferred Qualifications:
- Experience implementing reliability engineering best practices and SRE methodologies.
- Experience supporting large-scale enterprise production environments.
- Familiarity with high-availability and disaster recovery architectures.
- Experience automating operational workflows and infrastructure management.
- Knowledge of security best practices within cloud environments.
- Experience working in Agile and DevOps-driven organizations.
Soft Skills:
- Excellent verbal and written communication skills.
- Strong analytical and problem-solving capabilities.
- Ability to perform effectively during high-severity production incidents.
- Strong stakeholder management and cross-functional collaboration skills.
- Ability to prioritize multiple tasks in a fast-paced environment.
- Proactive mindset focused on continuous improvement and operational excellence.
Mandatory Skills:
- Site Reliability Engineering (SRE)
- Incident Management
- Change Control
- Error Budgeting
- Production Remediation
- Microsoft Azure / AWS / GCP
- Kubernetes
- Docker
- CI/CD Pipelines
- Infrastructure as Code (IaC)
- Python
- Bash
- PowerShell
- Go (Golang)
- Datadog
- Grafana
- Prometheus
- Splunk
- Linux Administration
- Windows Administration
- Networking
- Distributed Systems
- Cloud-Native Architectures
- Production Troubleshooting
- Operational Governance
Best Regards:
Monika G
Phone: 1-
Email: