Site Reliability Engineer, Cloud Incident Response

London - UK

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

As a leading financial services and healthcare technology company based on revenue SS&C is headquartered in Windsor Connecticut and has 27000 employees in 35 countries. Some 20000 financial services and healthcare organizations from the worlds largest companies to small and mid-market firms rely on SS&C for expertise scale and technology.

Job Description

GetToKnowUs:

SS&. SS&Cpridesitselfonhiringdiversehonestdynamicindividualswhovaluecollaborationaccountabilityandinnovationtonameafew.

Site Reliability Engineer

Location: London office hybrid 2 days per week onsite

About the Role

Were seeking a hands-on Site Reliability Engineer to enhance our production reliability scalability and operability. Youll use your expertise across observability Kubernetes AWS and infrastructure as code to investigate issues implement tactical fixes quickly and drive strategic improvements that raise availability and reduce toil. This is a hybrid role with two days per week in the office. Youll collaborate closely with engineering product and support to design build and run robust platforms that meet demanding SLAs/SLOs.

What Youll Do

Keep production healthy: Monitor troubleshoot and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high-quality post-incident actions.
Observability as a firstclass practice: Use Grafana Datadog and Splunk (and related tools like Prometheus/OpenTelemetry) to detect anomalies root cause issues and create actionable alerts and dashboards.
Run Kubernetes at scale: Operate and harden Kubernetes (EKS preferred); manage deployments autoscaling rollouts/rollbacks service mesh/ingress and cluster upgrades.
Build reliable cloud foundations: Design and operate AWS workloads (networking IAM EC2/EKS RDS/Aurora S3 CloudWatch ALB/NLB VPC Security Groups) with a security-first mindset.
Automate with IaC: Codify and continuously improve infrastructure using Terraform (modules workspaces remote state policy as code).
Enable fast safe delivery: Partner with teams to enhance CI/CD pipelines (e.g. GitHub Actions/Jenkins/Argo CD) progressive delivery and change management to lower the change failure rate.
Own reliability metrics: Define and iterate on SLOs/SLIs/error budgets; champion blameless postmortems and reliability reviews.
Participate in oncall: Join a fair welldocumented oncall rota; improve runbooks automation and alert quality to make oncall sustainable.
Drive strategic improvements: Identify systemic issues and deliver durable fixes (architecture capacity scaling caching resilience patterns rate limiting backpressure circuit breakers chaos engineering).

What you will bring

5 years operating production systems as an SRE DevOps engineer or software engineer.
Observability: Handson with Grafana Datadog and Splunk for incident investigation dashboarding alerting tracing/logs/metrics correlation and performance analysis.
Kubernetes: Strong experience running and troubleshooting workloads (controllers pods networking storage HPA/VPA Helm/Customise).
AWS: Solid practical knowledge of core services and best practices for security cost and reliability.
Terraform: Confident with module design state management DRY patterns and CI for IaC.
Oncall experience: Demonstrated participation in a production oncall rota effective incident communication and postincident followthrough.
Scripting & engineering fundamentals: Proficiency in at least one of Python Go or Bash; strong Linux networking (DNS TLS HTTP TCP) and Git.
Collaboration & communication: Ability to work crossfunctionally write clear runbooks/RFCs and influence engineering practices.

NicetoHave

EKS internals cluster autoscaler managed node groups/Fargate; service mesh (Istio/Linkerd) ingress controllers (Nginx/ALB).
Prometheus OpenTelemetry Loki/Tempo alert tuning and SLO burnrate alerts.
Argo CD/FluxCD Helm chart authoring Kustomize.
CD patterns (blue/green canary feature flags) GitOps workflows.
Database operations (Postgres/MySQL) caching (Redis) message queues (Kafka/SQS).
Security & compliance (CIS benchmarks IAM boundaries secrets management Vault/Sealed Secrets).
Resilience testing/chaos engineering.
Relevant certs (AWS Solutions Architect/DevOps Engineer CKA/CKAD Terraform Associate).

How We Work

Hybrid: Two days per week in the office for collaboration and incident/architecture reviews; remote the rest.
Engineering excellence: Blameless culture well-defined SLOs automationfirst and continuous learning.
Impact focus: Measure success via availability latency MTTR change failure rate toil reduction and customer outcomes.

OnCall Expectations

Participate in a rotating on-call schedule with clear escalation paths.
Improve alert signal-to-noise ratio and operational readiness (dashboards runbooks playbooks).
Post-incident reviews focused on learning and durable improvementsno blame.

Benefits

Competitive salary bonus (DOE)
Pension and comprehensive benefits
Modern tooling and time allocated for reliability improvements

We encourage applications from people of all backgrounds to enable us to bring diverse perspectives to our thinking and conversation. Its important to us that we strive to have a workforce that is diverse in the widest sense.

Thank you for your interest in SS&C! If applicable to further explore this opportunity please apply directly with us through our Careers page on our corporate website @ explicitly requested or approached by SS&C Technologies Inc. or any of its affiliated companies the company will not accept unsolicited resumes from headhunters recruitment agencies or fee-based recruitment services.

SS&C Technologies is an Equal Employment Opportunity employer and does not discriminate against any applicant for employment or employee on the basis of race color religious creed gender age marital status sexual orientation national origin disability veteran status or any other classification protected by applicable discrimination laws.

Job Description

GetToKnowUs:

SS&. SS&Cpridesitselfonhiringdiversehonestdynamicindividualswhovaluecollaborationaccountabilityandinnovationtonameafew.

Site Reliability Engineer

Location: London office hybrid 2 days per week onsite

About the Role

What Youll Do

Keep production healthy: Monitor troubleshoot and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high-quality post-incident actions.
Observability as a firstclass practice: Use Grafana Datadog and Splunk (and related tools like Prometheus/OpenTelemetry) to detect anomalies root cause issues and create actionable alerts and dashboards.
Run Kubernetes at scale: Operate and harden Kubernetes (EKS preferred); manage deployments autoscaling rollouts/rollbacks service mesh/ingress and cluster upgrades.
Build reliable cloud foundations: Design and operate AWS workloads (networking IAM EC2/EKS RDS/Aurora S3 CloudWatch ALB/NLB VPC Security Groups) with a security-first mindset.
Automate with IaC: Codify and continuously improve infrastructure using Terraform (modules workspaces remote state policy as code).
Enable fast safe delivery: Partner with teams to enhance CI/CD pipelines (e.g. GitHub Actions/Jenkins/Argo CD) progressive delivery and change management to lower the change failure rate.
Own reliability metrics: Define and iterate on SLOs/SLIs/error budgets; champion blameless postmortems and reliability reviews.
Participate in oncall: Join a fair welldocumented oncall rota; improve runbooks automation and alert quality to make oncall sustainable.
Drive strategic improvements: Identify systemic issues and deliver durable fixes (architecture capacity scaling caching resilience patterns rate limiting backpressure circuit breakers chaos engineering).

What you will bring

5 years operating production systems as an SRE DevOps engineer or software engineer.
Observability: Handson with Grafana Datadog and Splunk for incident investigation dashboarding alerting tracing/logs/metrics correlation and performance analysis.
Kubernetes: Strong experience running and troubleshooting workloads (controllers pods networking storage HPA/VPA Helm/Customise).
AWS: Solid practical knowledge of core services and best practices for security cost and reliability.
Terraform: Confident with module design state management DRY patterns and CI for IaC.
Oncall experience: Demonstrated participation in a production oncall rota effective incident communication and postincident followthrough.
Scripting & engineering fundamentals: Proficiency in at least one of Python Go or Bash; strong Linux networking (DNS TLS HTTP TCP) and Git.
Collaboration & communication: Ability to work crossfunctionally write clear runbooks/RFCs and influence engineering practices.

NicetoHave

EKS internals cluster autoscaler managed node groups/Fargate; service mesh (Istio/Linkerd) ingress controllers (Nginx/ALB).
Prometheus OpenTelemetry Loki/Tempo alert tuning and SLO burnrate alerts.
Argo CD/FluxCD Helm chart authoring Kustomize.
CD patterns (blue/green canary feature flags) GitOps workflows.
Database operations (Postgres/MySQL) caching (Redis) message queues (Kafka/SQS).
Security & compliance (CIS benchmarks IAM boundaries secrets management Vault/Sealed Secrets).
Resilience testing/chaos engineering.
Relevant certs (AWS Solutions Architect/DevOps Engineer CKA/CKAD Terraform Associate).

How We Work

Hybrid: Two days per week in the office for collaboration and incident/architecture reviews; remote the rest.
Engineering excellence: Blameless culture well-defined SLOs automationfirst and continuous learning.
Impact focus: Measure success via availability latency MTTR change failure rate toil reduction and customer outcomes.

OnCall Expectations

Participate in a rotating on-call schedule with clear escalation paths.
Improve alert signal-to-noise ratio and operational readiness (dashboards runbooks playbooks).
Post-incident reviews focused on learning and durable improvementsno blame.

Benefits

Competitive salary bonus (DOE)
Pension and comprehensive benefits
Modern tooling and time allocated for reliability improvements

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

SS&C

Leading cloud-based provider of financial services technology solutions. SS&C Technologies owns and maintains the best financial technology in the industry

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click