Hiring: W2 Candidates Only
Visa: Open to any visa type with valid work authorization in the USA
Summary
A Site Reliability Engineer (SRE) is responsible for ensuring the reliability scalability and performance of software systems and infrastructure. This role bridges the gap between development and operations by applying software engineering principles to IT operations automating processes and monitoring system health to prevent downtime and improve system efficiency.
Key Responsibilities
- Design implement and maintain reliable scalable and highly available infrastructure and services.
- Monitor system performance availability and capacity; respond proactively to incidents and outages.
- Develop and maintain automation tools for deployment monitoring and infrastructure management.
- Collaborate with software engineers to design systems with reliability and maintainability in mind.
- Troubleshoot debug and resolve complex production issues across multiple systems and services.
- Implement and maintain CI/CD pipelines configuration management and version control best practices.
- Conduct post-incident reviews identify root causes and implement corrective actions to prevent recurrence.
- Define and enforce service-level objectives (SLOs) service-level indicators (SLIs) and service-level agreements (SLAs).
- Optimize system performance cost and resource utilization through analysis and continuous improvement.
- Document infrastructure operational procedures incident reports and monitoring configurations.
- Mentor junior engineers and promote best practices for reliability automation and observability.
- Stay current with emerging technologies and DevOps practices to improve operational excellence.
Qualifications
- Bachelors degree in Computer Science Information Technology or a related field.
- 3-6 years of experience in site reliability engineering DevOps or system administration.
- Strong understanding of Linux/Unix systems networking and cloud platforms (AWS Azure GCP).
- Proficiency in scripting and programming languages such as Python Bash Go or Java.
- Experience with monitoring logging and observability tools (Prometheus Grafana ELK Stack).
- Familiarity with containerization and orchestration tools (Docker Kubernetes).
Preferred Skills / Duties
- Experience with Infrastructure as Code (Terraform Ansible CloudFormation).
- Knowledge of CI/CD tools and pipelines (Jenkins GitLab CircleCI).
- Understanding of distributed systems microservices architecture and high-availability systems.
- Strong problem-solving analytical and communication skills.
- Ability to implement security best practices in operational environments.
- Experience in automating repetitive operational tasks and improving system reliability
Hiring: W2 Candidates OnlyVisa: Open to any visa type with valid work authorization in the USA SummaryA Site Reliability Engineer (SRE) is responsible for ensuring the reliability scalability and performance of software systems and infrastructure. This role bridges the gap between development and o...
Hiring: W2 Candidates Only
Visa: Open to any visa type with valid work authorization in the USA
Summary
A Site Reliability Engineer (SRE) is responsible for ensuring the reliability scalability and performance of software systems and infrastructure. This role bridges the gap between development and operations by applying software engineering principles to IT operations automating processes and monitoring system health to prevent downtime and improve system efficiency.
Key Responsibilities
- Design implement and maintain reliable scalable and highly available infrastructure and services.
- Monitor system performance availability and capacity; respond proactively to incidents and outages.
- Develop and maintain automation tools for deployment monitoring and infrastructure management.
- Collaborate with software engineers to design systems with reliability and maintainability in mind.
- Troubleshoot debug and resolve complex production issues across multiple systems and services.
- Implement and maintain CI/CD pipelines configuration management and version control best practices.
- Conduct post-incident reviews identify root causes and implement corrective actions to prevent recurrence.
- Define and enforce service-level objectives (SLOs) service-level indicators (SLIs) and service-level agreements (SLAs).
- Optimize system performance cost and resource utilization through analysis and continuous improvement.
- Document infrastructure operational procedures incident reports and monitoring configurations.
- Mentor junior engineers and promote best practices for reliability automation and observability.
- Stay current with emerging technologies and DevOps practices to improve operational excellence.
Qualifications
- Bachelors degree in Computer Science Information Technology or a related field.
- 3-6 years of experience in site reliability engineering DevOps or system administration.
- Strong understanding of Linux/Unix systems networking and cloud platforms (AWS Azure GCP).
- Proficiency in scripting and programming languages such as Python Bash Go or Java.
- Experience with monitoring logging and observability tools (Prometheus Grafana ELK Stack).
- Familiarity with containerization and orchestration tools (Docker Kubernetes).
Preferred Skills / Duties
- Experience with Infrastructure as Code (Terraform Ansible CloudFormation).
- Knowledge of CI/CD tools and pipelines (Jenkins GitLab CircleCI).
- Understanding of distributed systems microservices architecture and high-availability systems.
- Strong problem-solving analytical and communication skills.
- Ability to implement security best practices in operational environments.
- Experience in automating repetitive operational tasks and improving system reliability
View more
View less