Title: Sr. SRE / DevOps Engineer
Location: Sunnyvale CA - Onsite
Job Description:
For this role we are looking for a Sr. SRE / DevOps Engineer at Sunnyvale California location.
As Site Reliability Engineer the individual will work closely with multi-functional teams automate operations optimize infrastructure implement security and solve issues in an exciting fast-paced environment. The individual will play a vital role in ensuring that the systems are reliable scalable and high performing.
Technical Skills:
- DevOps and SRE
- AWS Kubernetes/EKS Docker
- Terraform Ansible or CloudFormation
- Apache Splunk Apache Flink
- Programming/Scripting using Java or Python
- CI/CD
- Database Vertica Snowflake
Responsibilities
- Ensure system reliability and availability Monitor system issues create strategies to detect issues address those issues design automated systems to troubleshoot write and review post-mortems.
- Mitigate Operational risks - Collaborate with development teams and other stakeholders to identify potential risks perform risk assessments implement risk mitigation strategies continuously monitor and review the effectiveness of risk strategies.
- Monitor system health.
- Minimize emergency response (MTTR).
- Maintain CI/CD pipelines etc.
- Continuous improvement by collaborating with various teams.
- Automation of processes.
- Must have/required experience and skills:
- 8 years of experience on DevOps and Site Reliability Engineering.
- Hands-on with containerization and orchestration: Docker Kubernetes/EKS.
- Proficiency in infrastructure as code tools: Terraform Ansible or CloudFormation.
- Experience setting up and managing services running on Kubernetes.
- In-depth understanding of SRE principals including monitoring alerting error budgets fault analysis and automation.
- In-depth knowledge of monitoring and observability tools: Apache Splunk
- Knowledge of Linux operating system principles networking fundamentals and systems management
- Demonstrable fluency in at least one of the following languages: Java or Python
- Ability to identify and communicate technical and architectural problems while working with partners and their team to iteratively find solutions.
- Building and managing CI/CD pipeline gatekeeping production deployments develop and implement GIT branching strategies branch protection rules network policies scale up/ scale down the load on AWS.
- Strong problem-solving and analytical skills
- Solve performance issues and scalability issues in the system.
Behavioral Skills:
- Excellent Communication skills and collaboration skills
- Ability to propose and implement improvements in the system
- Ability to work with cross-functional stakeholders
- Adaptability and a willingness to learn new technologies and techniques.
- Proactive approach to issues ability to provide prompt resolution/work around.