Staff, Site Reliability Engineer

San Diego, CA - USA

Monthly Salary: $ 183500 - 232500

Posted on: 30+ days ago

Vacancies: 1 Vacancy

The job posting is outdated and position may be filled

Job Summary

Why Lytx:

We are a team of Hungry Low ego and capable engineers that design and support our IOT Infrastructure. Are you interested in Operations as Code Infrastructure as Code and infrastructure automation solutions If so keep reading....

Site Reliability Engineering team is responsible for the availability reliability observability and resilience of Infrastructure and related automation of the entire fleet of servers on-prem and the expanding cloud posture of the organization. This teams responsibilities are very critical to the continuity of business of the organization. If you love crafting new solutions and building a scalable cloud and on-prem infrastructure then this role may be an excellent match for you!

What youll get to do:

Strategic Leadership: Define and drive the strategic direction for SRE practices and reliability engineering within the organization influencing both technical and operational strategies.
Advanced System Architecture: Architect and implement complex systems and solutions addressing high-impact and cross-team challenges with a focus on scalability reliability and performance.
High-Level Incident Management: Lead major incident response efforts and postmortem analyses ensuring thorough investigations and comprehensive resolution strategies to improve overall system resilience.
Cross-Functional Collaboration: Partner with engineering operations and product teams to embed reliability and performance best practices into all aspects of system design and development.
Innovation and Improvement: Drive innovation in reliability engineering practices introducing new tools technologies and methodologies to enhance system performance and operational efficiency.
Strategic Capacity Planning: Oversee long-term capacity planning and forecasting aligning resource allocation with business goals and scaling needs to ensure continuous service reliability.
Mentorship and Leadership: Provide guidance and mentorship to senior and junior SREs fostering a culture of learning and professional development within the SRE team.
Organizational Impact: Contribute to and influence organizational policies procedures and best practices related to system reliability ensuring alignment with broader business objectives and industry standards.

What youll need:

8 years of experience as an SRE in AWS environments within medium to large-scale organizations.
8 years of hands-on experience with observability tools including Prometheus New Relic Grafana or similar.
Exceptional proficiency in programming with expertise in Python Go PowerShell YAML and Bash.
Extensive experience managing database technologies both SQL and NoSQL.
5 years of experience in designing and building infrastructure deployment pipelines using Git GHA Terraform Helm or similar tools.
Advanced expertise in designing and managing production environments in AWS including services such as VPCs EKS IAM AMI EC2 CloudWatch CloudTrail Control Tower GuardDuty MSK S3 Glacier Gateways Direct Connect Route 53 RDS ALBs Autoscaling and more.
Deep knowledge of Linux systems and a range of protocols and technologies including HTTP REST TCP/IP SSL DNS SMTP SSH NTP Load Balancing SQL/NoSQL Message Brokers Nginx Vault ELK and others.
Expert level experience with Kubernetes and a variety of container and cloud-native technologies.
Proven ability to manage 24/7 on-call rotations develop runbooks establish support procedures and proactively monitor systems across multiple geographic locations.
Ability to excel under pressure in complex high-stakes environments.

Benefits:

Medical dental and vision insurance
Health Savings Account
Flexible Spending Accounts
Telehealth
401(k) and 401(k) match
Life and AD&D insurance
Short-Term and Long-Term Disability
FTO or PTO
Employee Well-Being program
11 paid holidays plus 1 inclusive holiday per year
Volunteer Time Off
Employee Referral program
Education Reimbursement Program
Employee Recognition and Appreciation program
Additional perk and voluntary benefit programs

Salary is based on a number of factors including market location and may vary depending on job-related knowledge skills and experience. This position is also eligible for an incentive compensation plan. The expected hiring salary for this position is:

$183500.00 - $232500.00

Innovation Lives Here

Together we help save lives on our roadways!

Lytx Inc. is proud to be an equal opportunity employer. Were committed to building a diverse and inclusive workforce and do not discriminate based on race color religion sex sexual orientation gender identity or expression gender genetic information uniformed service national origin age veteran status disability pregnancy or any other status protected by federal or state law. We are committed to providing reasonable accommodation for candidates with disabilities who need assistance during the hiring process. To request a reasonable accommodation please email . Lytx conducts background checks on applicants who receive a conditional offer of employment in accordance with applicable local state federal and regional laws. Qualified applicants with arrest or conviction records will be considered. Background check results may potentially result in the withdrawal of a conditional offer of employment and will be made in accordance with all applicable local state federal and regional laws.

Why Lytx:We are a team of Hungry Low ego and capable engineers that design and support our IOT Infrastructure. Are you interested in Operations as Code Infrastructure as Code and infrastructure automation solutions If so keep reading....Site Reliability Engineering team is responsible for the availa...

Why Lytx:

What youll get to do:

Strategic Leadership: Define and drive the strategic direction for SRE practices and reliability engineering within the organization influencing both technical and operational strategies.
Advanced System Architecture: Architect and implement complex systems and solutions addressing high-impact and cross-team challenges with a focus on scalability reliability and performance.
High-Level Incident Management: Lead major incident response efforts and postmortem analyses ensuring thorough investigations and comprehensive resolution strategies to improve overall system resilience.
Cross-Functional Collaboration: Partner with engineering operations and product teams to embed reliability and performance best practices into all aspects of system design and development.
Innovation and Improvement: Drive innovation in reliability engineering practices introducing new tools technologies and methodologies to enhance system performance and operational efficiency.
Strategic Capacity Planning: Oversee long-term capacity planning and forecasting aligning resource allocation with business goals and scaling needs to ensure continuous service reliability.
Mentorship and Leadership: Provide guidance and mentorship to senior and junior SREs fostering a culture of learning and professional development within the SRE team.
Organizational Impact: Contribute to and influence organizational policies procedures and best practices related to system reliability ensuring alignment with broader business objectives and industry standards.

What youll need:

8 years of experience as an SRE in AWS environments within medium to large-scale organizations.
8 years of hands-on experience with observability tools including Prometheus New Relic Grafana or similar.
Exceptional proficiency in programming with expertise in Python Go PowerShell YAML and Bash.
Extensive experience managing database technologies both SQL and NoSQL.
5 years of experience in designing and building infrastructure deployment pipelines using Git GHA Terraform Helm or similar tools.
Advanced expertise in designing and managing production environments in AWS including services such as VPCs EKS IAM AMI EC2 CloudWatch CloudTrail Control Tower GuardDuty MSK S3 Glacier Gateways Direct Connect Route 53 RDS ALBs Autoscaling and more.
Deep knowledge of Linux systems and a range of protocols and technologies including HTTP REST TCP/IP SSL DNS SMTP SSH NTP Load Balancing SQL/NoSQL Message Brokers Nginx Vault ELK and others.
Expert level experience with Kubernetes and a variety of container and cloud-native technologies.
Proven ability to manage 24/7 on-call rotations develop runbooks establish support procedures and proactively monitor systems across multiple geographic locations.
Ability to excel under pressure in complex high-stakes environments.

Benefits: