Who we are
Were a leading global security authority thats disrupting our own category. Our encryption is trusted by the major ecommerce brands the worlds largest companies the major cloud providers entire country financial systems entire internets of things and even down to the little things like surgically embedded pacemakers. We help companies put trust an abstract idea to work. Thats digital trust for the real world.
Job Summary
The Site Reliability Engineer (SRE) collaborates with development teams to embed reliability scalability and performance best practices throughout the software development lifecycle. This role bridges software engineering and cloud operations ensuring missioncritical systems remain highly available and resilient. By integrating reliability early the SRE fosters a culture of shared responsibility while enabling rapid and safe feature delivery.
What you will do
- Design and build faulttolerant highperforming systems that meet Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Implement monitoring alerting distributed tracing and logging to ensure realtime system health visibility and proactive issue resolution.
- Act as a first responder for production incidents conduct blameless postmortems and drive root cause analysis (RCA) and corrective actions.
- Develop selfhealing automated deployments and scaling solutions to minimize toil and improve system efficiency.
- Improve continuous integration and deployment pipelines to enable safe rapid and reliable feature rollouts.
- Review code debug issues and perform quality assurance (QA) on software components to enhance system reliability and performance.
- Work closely with development teams to ensure best practices in software architecture coding standards and operational readiness.
- Forecast scalability needs and optimize cloud infrastructure costs while balancing performance and efficiency.
- Ensure production environments meet security and compliance requirements collaborating with teams to mitigate vulnerabilities and enforce best practices.
- Work closely with development teams to embed reliability at every stage rather than treating it as an afterthought.
- Use error budgets to balance feature velocity with system stability.
- Implement observability and automationfirst principles to measure system health and drive continuous improvement.
- Leverage game days chaos engineering and resilience testing to validate system robustness and refine operational processes.
What you will have
- 35 years of extensive experience in distributed systems cloudnative architectures (AWS GCP Azure) and DevOps practices.
- Proficiency in Kubernetes Terraform CI/CD pipelines and Infrastructure as Code (IaC).
- Strong scripting and automation skills in Python Go Bash or similar languages.
- Expertise in observability tools such as Prometheus Grafana Datadog Splunk New Relic and Open Telemetry.
- Ability to troubleshoot complex production issues and drive scalable resilient solutions.
- Experience reviewing code debugging applications and conducting software testing to ensure high reliability and quality.
Benefits
- Generous time off policies
- Top shelf benefits
- Education wellness and lifestyle support
#LISD1