Site Reliability Engineer (fmx)

Cologne - Germany

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Location: Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)
Team: Engineering Reports to CTO

Keep the world awake build reliability at scale

ilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.

Our platform is mission-critical: customers rely on us 24/7 to keep their always-on businesses running.

As a Site Reliability Engineer at ilert youll own the reliability performance and scalability of our core platform across AWS Kubernetes Kafka and more.

Tasks

Build & operate a highly available platform

Run and evolve our AWS-based infrastructure
Operate and optimize self-managed Kafka ClickHouse clusters and our Observability stack
Ensure resilience disaster recovery and capacity planning across the stack

Improve reliability & performance

Build and maintain SLOs SLIs error budgets and observability dashboards
Debug production issues across layers (networking Kubernetes application DB)
Improve performance of our ingestion pipeline

Automation & tooling

Automate operations with Terraform Helm Kubernetes operators and internal tooling
Build tooling for safer deploys blue/green rollouts and automated verification
Strengthen incident response workflows through deep collaboration with our AI SRE agent team

Security & compliance

Implement best practices for workload isolation secrets management IAM and auditability
Support our ISO27001 posture by automating controls and hardening our infrastructure

Cross-functional impact

Partner with Backend AI and Product teams to design reliable services
Participate in on-call rotation
Lead post-incident reviews and drive reliability improvements long-term

Requirements

3 years experience as SRE Platform Engineer DevOps Engineer or Infrastructure Engineer
Strong hands-on experience with AWS Kubernetes Linux internals networking performance tuning
Experience operating self-managed distributed systems ideally Kafka or ClickHouse
Strong understanding of observability
Experience automating infrastructure with Terraform and CI/CD systems
Fluent English (our working language); German optional

Benefits

Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on business
Hybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when youre in town
Focus > meetings - We time-box syncs favour async docs and protect maker time
28 days off - plus public holidays
Commute perks - subsidised public transport

Location: Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)Team: Engineering Reports to CTOKeep the world awake build reliability at scaleilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.Our platform is mission-critical: customers rely o...

Location: Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)
Team: Engineering Reports to CTO

Keep the world awake build reliability at scale

ilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.

Our platform is mission-critical: customers rely on us 24/7 to keep their always-on businesses running.

As a Site Reliability Engineer at ilert youll own the reliability performance and scalability of our core platform across AWS Kubernetes Kafka and more.

Tasks

Build & operate a highly available platform

Run and evolve our AWS-based infrastructure
Operate and optimize self-managed Kafka ClickHouse clusters and our Observability stack
Ensure resilience disaster recovery and capacity planning across the stack

Improve reliability & performance

Build and maintain SLOs SLIs error budgets and observability dashboards
Debug production issues across layers (networking Kubernetes application DB)
Improve performance of our ingestion pipeline

Automation & tooling

Automate operations with Terraform Helm Kubernetes operators and internal tooling
Build tooling for safer deploys blue/green rollouts and automated verification
Strengthen incident response workflows through deep collaboration with our AI SRE agent team

Security & compliance

Implement best practices for workload isolation secrets management IAM and auditability
Support our ISO27001 posture by automating controls and hardening our infrastructure

Cross-functional impact

Partner with Backend AI and Product teams to design reliable services
Participate in on-call rotation
Lead post-incident reviews and drive reliability improvements long-term

Requirements

3 years experience as SRE Platform Engineer DevOps Engineer or Infrastructure Engineer
Strong hands-on experience with AWS Kubernetes Linux internals networking performance tuning
Experience operating self-managed distributed systems ideally Kafka or ClickHouse
Strong understanding of observability
Experience automating infrastructure with Terraform and CI/CD systems
Fluent English (our working language); German optional

Benefits

Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on business
Hybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when youre in town
Focus > meetings - We time-box syncs favour async docs and protect maker time
28 days off - plus public holidays
Commute perks - subsidised public transport

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

Ilert GmbH

ilert is a SaaS company for alerting, on-call management and status pages and helps companies to operate always-on services and respond faster to incidents.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Site Reliability Engineer (fmx)

Cologne - Germany

Job Summary

Tasks

Requirements

Benefits

Tasks

Requirements

Benefits

Key Skills

About Company

Related Jobs