Staff Site Reliability Engineer
Job Summary
Why Lytx Staff Site Reliability Engineer
At Lytx our engineering culture is built around being hungry low-ego and highly capable. We are pragmatic engineers who take ownership collaborate openly and focus on delivering measurable operational impact. Our mission is to design operate and continuously improve the cloud infrastructure and operational platforms that power mission-critical SaaS and IoT services at scale.
As our platform grows in scale and complexity we are investing in next-generation observability intelligent automation and data-driven operations to improve reliability reduce operational noise and enable faster detection and recovery. We are also expanding the use of AI and advanced analytics to move toward more proactive and automated operations.
The Site Reliability Engineering (SRE) team is responsible for the availability reliability observability and resilience of our cloud-native environments. This includes building automation improving operational intelligence and partnering across engineering to ensure systems are designed and operated for reliability and scale.
As a Staff SRE you will operate as a technical leader across multiple teams and services. You will drive reliability initiatives influence architecture and operational practices and lead efforts that reduce operational risk improve system visibility and increase the effectiveness of engineering through automation and intelligent operations.
If you enjoy solving complex distributed systems challenges building scalable operational solutions and leading improvements that have broad impact across the organization this role is an excellent fit.
Responsibilities / Youll get to
Technical Leadership Across Services - Lead reliability performance and operational improvements across multiple services or platform domains working with engineering teams to ensure systems meet availability and scalability goals.
Observability Architecture & Strategy (Team-Level) - Design and drive improvements to monitoring logging tracing and alerting. Establish patterns and reusable solutions that improve signal quality reduce alert noise and enable faster detection and diagnosis.
Operational Automation & AIOps - Lead initiatives that reduce operational toil through automation including runbook automation self-healing workflows event correlation anomaly detection and automated remediation.
Incident Leadership & Systemic Improvement - Provide technical leadership during high-severity incidents and drive blameless postmortems that identify systemic issues and result in durable reliability improvements.
Reliability Engineering & Resilience - Partner with product and platform teams to embed reliability performance and fault tolerance into system design including capacity planning scaling strategies and failure-mode analysis.
Infrastructure & Cloud Engineering - Design and implement scalable AWS infrastructure using Infrastructure-as-Code and cloud-native best practices enabling consistent and reliable service operation.
Standards & Best Practices Influence - Influence SRE practices such as SLO/SLI design alerting standards operational readiness and reliability reviews. Contribute to evolving operational standards and engineering guidelines.
Cross-Functional Collaboration - Work closely with developers platform engineers architects and operations teams to drive reliability observability and operational maturity across the engineering organization.
Mentorship & Technical Guidance - Mentor Senior and mid-level engineers provide technical guidance and act as a subject matter expert for reliability observability and operational excellence.
Innovation & Tooling Evaluation - Evaluate and introduce new tools AWS-native capabilities and emerging observability or AI-enabled operational technologies that improve reliability and engineering efficiency.
Requirements / Youll Need
Experience
- 6 - 8 years of experience in SRE DevOps platform engineering or cloud infrastructure roles supporting large-scale production environments.
- Demonstrated experience leading reliability or infrastructure initiatives across multiple teams or services.
- Strong experience operating 24/7 production systems including incident leadership root cause analysis and proactive reliability improvement.
Cloud & Infrastructure
- Deep hands-on experience designing and operating production workloads in AWS including services such as EC2 EKS/ECS RDS/DynamoDB S3 ALB/NLB VPC IAM and CloudWatch.
- Strong experience building and managing infrastructure using Terraform CloudFormation or similar Infrastructure-as-Code tools.
Observability
- Strong experience designing and implementing observability solutions using tools such as New Relic Datadog Prometheus/Grafana CloudWatch or similar.
- Experience with OpenTelemetry or modern telemetry standards.
- Experience improving telemetry quality alert tuning dashboard design and operational visibility across multiple services.
Automation & Engineering
- Strong programming or scripting skills (Python Go Bash or similar) for building automation operational tooling and integrations.
- Experience building reusable automation frameworks or shared tooling preferred.
Systems & Platform Expertise
- Strong understanding of Linux systems networking fundamentals (TCP/IP DNS TLS) and distributed system behavior.
- Experience with Kubernetes and cloud-native architectures.
Operational Intelligence (Preferred)
- Experience improving operational signal quality through alert noise reduction event correlation anomaly detection or automated remediation.
- Experience with AIOps concepts or AI-assisted operational tooling.
Leadership & Influence
- Demonstrated ability to influence technical decisions without direct authority.
- Experience mentoring engineers and driving cross-team technical initiatives.
- Ability to operate effectively in complex high-impact production environments.
Youre driven to succeed and so are we. At Lytx our mission is to protect a world in motion and we do it by building technology and partnerships that help keep people safe on the road. The way we work is guided by our shared values: Deliver for the customer Responsibility in every outcome Innovate with purpose Velocity with excellence and Elevate each other.
If youre looking for meaningful work a team that challenges and supports you and the chance to grow your career while making a real impact wed love to meet you.
Together were helping make roadways safer and saving lives!
Lytx Inc. is proud to be an equal opportunity employer. Were committed to building a diverse and inclusive workforce and do not discriminate based on race color religion sex sexual orientation gender identity or expression gender genetic information uniformed service national origin age veteran status disability pregnancy or any other status protected by federal or state law. We are committed to providing reasonable accommodation for candidates with disabilities who need assistance during the hiring process. To request a reasonable accommodation please email . Lytx conducts background checks on applicants who receive a conditional offer of employment in accordance with applicable local state federal and regional laws. Qualified applicants with arrest or conviction records will be considered. Background check results may potentially result in the withdrawal of a conditional offer of employment and will be made in accordance with all applicable local state federal and regional laws.
Required Experience:
Staff IC
About Company
Since 1998, Lytx has led the video telematics industry using proprietary machine vision, artificial intelligence, and big data to protect and connect thousands of fleets and millions of drivers in more than 85 countries worldwide. At Lytx, you'll be a part something good - helping sav ... View more