Senior SaaS Platform Reliability Engineer

Once For All UK

Not Interested
Bookmark
Report This Job

profile Job Location:

Hampshire - UK

profile Salary: Not Disclosed
Posted on: 4 hours ago
Vacancies: 1 Vacancy

Job Summary

Once For All is a high-growth cloud-based SaaS subscription business helping organisations manage supply chain governance risk management and compliance. We support over 250000 customers across the UK across more than 20 public and private sector industries including construction transport retail hospitality education facilities management manufacturing and central and local government.

Role Summary:

Join our engineering team as a Senior SaaS Platform Reliability Engineer taking a foundational role in establishing and maturing reliability engineering practices across our SaaS platform.

This is not an advisory or oversight-only role. You will personally contribute to reliability improvements working in partnership with our SCRUM-based product teams and our DevOps and Cloud engineering teams to ensure those improvements are delivered safely into production.

You will help define SLIs SLOs and error budgets for tier-1 customer-facing services and contribute to how reliability trade-offs are made and communicated. The role focuses on improving and evolving existing systemsobservability alerting performance tooling automation and testingrebuilding components where necessary to better support SLO-driven reliability.

This role is fully remote working within UK time zones.

What Hands-On Means in this role

To be explicit:

  • This is not a strategy-only role. You will personally contribute reliability improvements working alongside product and platform teams to see them delivered into production.

  • Expect to spend 6080% of your time building and operating including:

    • Instrumentation dashboards and alerting

    • Automation to reduce operational toil

    • Release safety mechanisms and guardrails

  • You will be an active on-call contributor for tier-1 services and lead incident response end-to-end (triage mitigation permanent fix postmortem).

  • You will write production code (not only infrastructure-as-code) to improve reliability availability and performance.

  • You will build on and improve existing observability alerting performance automation and test systems rebuilding components where necessary rather than starting from scratch.

  • Success in this role is measured by outcomes such as:

    • Clear and trusted SLO reporting

    • Reduced alert noise

    • Improved p95 / p99 latency

    • Fewer production regressions and rollbacks

Job Responsibilities:

Reliability & SLO Ownership

  • Define user-centric SLIs and SLOs for critical customer journeys.

  • Define and maintain error budgets and work with sprint teams to guide how error budgets are understood and spent.

  • Provide visibility and expertise to support reliability-focused trade-offs while sprint teams remain accountable for delivery.

  • Own reliability outcomes for tier-1 customer-facing services.

Observability & Alerting

  • Evaluate design and improve end-to-end observability across metrics logs and traces.

  • Improve existing alerting to focus on customer impact rather than infrastructure noise.

  • Continuously refine signals to improve detection reduce false positives and minimise alert fatigue.

Incident Response & Learning

  • Actively participate in and lead major incident response when required.

  • Run blameless postmortems focused on systemic improvement.

  • Track and improve MTTR MTTD and incident recurrence over time.

Automation & Toil Reduction

  • Identify operational toil and reduce it through engineering and automation.

  • Improve and extend existing automation and operational tooling.

  • Treat reliability issues as software problems not process gaps.

Availability Performance & Scalability

  • Contribute to zone- and region-aware architectures on Azure and AKS.

  • Perform capacity planning and load testing aligned to SLOs.

  • Improve p95 and p99 latency throughput and behaviour under failure conditions.

Release Safety & Regression Prevention

  • Improve safe release mechanisms including canary blue/green and progressive delivery.

  • Strengthen detection and rollback strategies for faulty releases.

  • Partner with teams to improve test coverage where it directly protects production reliability.

Platform & Infrastructure

  • Contribute to infrastructure as code using Terraform or Bicep.

  • Help enforce standards and guardrails via policy as code.

  • Support platform security through managed identity secrets management least privilege and network segmentation.

Coaching & Influence

  • Raise reliability standards across engineering teams through hands-on collaboration and guidance.

  • Help teams adopt SLOs observability and safer operational practices pragmatically.

  • What This Role Is and Is Not

This role is:

A senior hands-on reliability engineering role

Focused on improving production outcomes through engineering

Collaborative working through and with existing teams

This role is not:

A pure DevOps or cloud infrastructure role

A people-management position

A greenfield rebuild of all systems

An expectation to fix everything at once

Candidate Requirements

Essential Experience:

Proven experience contributing to production reliability for customer-facing SaaS systems.

Hands-on experience defining and operating with SLIs SLOs and error budgets in collaboration with delivery teams.

Strong experience leading or coordinating incident response and delivering permanent fixes.

Demonstrated ability to improve observability and alerting to reduce noise and improve recovery.

Experience reducing operational toil through automation and software engineering.

Technical Skills

Strong experience with Microsoft Azure including AKS Front Door or Application Gateway VNets Private Link Key Vault Azure Monitor Log Analytics and Application Insights.

Production Kubernetes experience on AKS including networking PodDisruptionBudgets autoscaling (HPA/VPA) node pools upgrades and failure scenarios.

Infrastructure as Code using Terraform or Bicep with Git-based workflows.

CI/CD experience supporting progressive delivery and automated rollback.

Software development and automation using Python or Go plus Bash or PowerShell.

Experience tuning latency error rates and throughput under real production load.

Practical experience designing and testing backup restore and disaster recovery with clear RPO and RTO.

Important: We value real-world reliability problem solving over titles or years of experience. Candidates should be able to describe specific production incidents they contributed to and the lasting improvements they delivered.

Nice to Have

Experience with multi-tenant SaaS reliability and data isolation.

Service mesh eBPF or advanced traffic management.

FinOps practices tied to reliability (e.g. cost per request or per tenant).

Experience supporting compliance-driven availability and audit requirements.

What We Offer

  • Health & Wellbeing: Private Medical Insurance or wellness fund 24/7 Employee Assistance Programme.

  • Financial Benefits: Pension Life Assurance (3 salary).

  • Time Off: 25 days holiday 8 bank holidays holiday purchase scheme (5 days) paid and unpaid volunteering days.

  • Growth & Development: Ongoing CPD team offsites and company events.

  • Everyday Perks: Home office budget high-spec laptop and peripherals.

  • Work Setup: Fully remote within UK time zones with optional access to our Basingstoke office.

Interview Process

  1. Intro & role overview with Talent.

  2. Technical deep dive covering reliability trade-offs Azure AKS and real incident experience.

  3. Practical exercise:

    • Define SLIs/SLOs for a sample service

    • Propose an alerting strategy focused on customer impact

    • Design a safe release and rollback plan

  4. Collaboration & influence interview with Engineering.

Once For All is a high-growth cloud-based SaaS subscription business helping organisations manage supply chain governance risk management and compliance. We support over 250000 customers across the UK across more than 20 public and private sector industries including construction transport retail ho...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting