Hiring: W2 Candidates Only
Visa: Open to any visa type with valid work authorization in the USA
We are seeking a highly skilled Site Reliability Engineer (SRE) to build scale and maintain our production infrastructure. The ideal candidate blends software engineering expertise with strong operational discipline. You will ensure the reliability availability security and performance of our cloud-based systems while driving automation and continuous improvement across engineering teams.
Key Responsibilities
- Design build and manage highly scalable and reliable infrastructure across cloud environments (AWS/Azure/GCP).
- Develop automation for deployment monitoring scaling and recovery using tools such as Terraform Ansible Helm or CloudFormation.
- Implement CI/CD pipelines and partner with development teams to enhance deployment velocity and operational stability.
- Monitor system performance using tools like Prometheus Grafana Datadog ELK Stack or CloudWatch.
- Perform incident response root cause analysis (RCA) and postmortems to ensure continuous improvement.
- Build and maintain robust alerting systems and SLO/SLIs to uphold service-level reliability targets.
- Improve system resilience through capacity planning chaos engineering fault-tolerance testing and disaster recovery strategies.
- Maintain and enhance security posture ensure compliance and enforce operational best practices.
- Manage containers and orchestration platforms such as Docker and Kubernetes at scale.
- Collaborate with cross-functional teams to drive reliability performance tuning and cost optimization.
Required Skills & Qualifications
- Bachelors degree in Computer Science Engineering or a related technical field.
- 4-8 years of SRE DevOps or Cloud Engineering experience.
- Strong proficiency in cloud platforms: AWS Azure or GCP.
- Expertise with infrastructure-as-code tools (Terraform CloudFormation Pulumi Ansible).
- Hands-on experience with Kubernetes Docker and container orchestration.
- Strong scripting/programming skills in Python Go Bash or similar.
- Solid understanding of networking fundamentals (DNS TCP/IP Load Balancing VPC).
- Experience with monitoring log management and observability tools.
- Strong problem-solving debugging and troubleshooting skills in large-scale distributed systems.
- Good communication skills and ability to work in fast-paced collaborative environments.
Preferred Qualifications
- Experience supporting microservices-based architectures.
- Knowledge of serverless technologies (Lambda GCP Cloud Functions Azure Functions).
- Experience with GitOps tools (ArgoCD Flux).
- Background in security hardening compliance or cloud architecture.
- Familiarity with chaos engineering tools (Gremlin LitmusChaos).
- Experience in on-call rotations with strong incident management skills
Hiring: W2 Candidates Only Visa: Open to any visa type with valid work authorization in the USA We are seeking a highly skilled Site Reliability Engineer (SRE) to build scale and maintain our production infrastructure. The ideal candidate blends software engineering expertise with strong operationa...
Hiring: W2 Candidates Only
Visa: Open to any visa type with valid work authorization in the USA
We are seeking a highly skilled Site Reliability Engineer (SRE) to build scale and maintain our production infrastructure. The ideal candidate blends software engineering expertise with strong operational discipline. You will ensure the reliability availability security and performance of our cloud-based systems while driving automation and continuous improvement across engineering teams.
Key Responsibilities
- Design build and manage highly scalable and reliable infrastructure across cloud environments (AWS/Azure/GCP).
- Develop automation for deployment monitoring scaling and recovery using tools such as Terraform Ansible Helm or CloudFormation.
- Implement CI/CD pipelines and partner with development teams to enhance deployment velocity and operational stability.
- Monitor system performance using tools like Prometheus Grafana Datadog ELK Stack or CloudWatch.
- Perform incident response root cause analysis (RCA) and postmortems to ensure continuous improvement.
- Build and maintain robust alerting systems and SLO/SLIs to uphold service-level reliability targets.
- Improve system resilience through capacity planning chaos engineering fault-tolerance testing and disaster recovery strategies.
- Maintain and enhance security posture ensure compliance and enforce operational best practices.
- Manage containers and orchestration platforms such as Docker and Kubernetes at scale.
- Collaborate with cross-functional teams to drive reliability performance tuning and cost optimization.
Required Skills & Qualifications
- Bachelors degree in Computer Science Engineering or a related technical field.
- 4-8 years of SRE DevOps or Cloud Engineering experience.
- Strong proficiency in cloud platforms: AWS Azure or GCP.
- Expertise with infrastructure-as-code tools (Terraform CloudFormation Pulumi Ansible).
- Hands-on experience with Kubernetes Docker and container orchestration.
- Strong scripting/programming skills in Python Go Bash or similar.
- Solid understanding of networking fundamentals (DNS TCP/IP Load Balancing VPC).
- Experience with monitoring log management and observability tools.
- Strong problem-solving debugging and troubleshooting skills in large-scale distributed systems.
- Good communication skills and ability to work in fast-paced collaborative environments.
Preferred Qualifications
- Experience supporting microservices-based architectures.
- Knowledge of serverless technologies (Lambda GCP Cloud Functions Azure Functions).
- Experience with GitOps tools (ArgoCD Flux).
- Background in security hardening compliance or cloud architecture.
- Familiarity with chaos engineering tools (Gremlin LitmusChaos).
- Experience in on-call rotations with strong incident management skills
View more
View less