Location: Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)
Team: Engineering Reports to CTO
Keep the world awake build reliability at scale
ilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.
Our platform is mission-critical: customers rely on us 24/7 to keep their always-on businesses running.
As a Site Reliability Engineer at ilert youll own the reliability performance and scalability of our core platform across AWS Kubernetes Kafka and more.
Tasks
Build & operate a highly available platform
- Run and evolve our AWS-based infrastructure
- Operate and optimize self-managed Kafka ClickHouse clusters and our Observability stack
- Ensure resilience disaster recovery and capacity planning across the stack
Improve reliability & performance
- Build and maintain SLOs SLIs error budgets and observability dashboards
- Debug production issues across layers (networking Kubernetes application DB)
- Improve performance of our ingestion pipeline
Automation & tooling
- Automate operations with Terraform Helm Kubernetes operators and internal tooling
- Build tooling for safer deploys blue/green rollouts and automated verification
- Strengthen incident response workflows through deep collaboration with our AI SRE agent team
Security & compliance
- Implement best practices for workload isolation secrets management IAM and auditability
- Support our ISO27001 posture by automating controls and hardening our infrastructure
Cross-functional impact
- Partner with Backend AI and Product teams to design reliable services
- Participate in on-call rotation
- Lead post-incident reviews and drive reliability improvements long-term
Requirements
- 3 years experience as SRE Platform Engineer DevOps Engineer or Infrastructure Engineer
- Strong hands-on experience with AWS Kubernetes Linux internals networking performance tuning
- Experience operating self-managed distributed systems ideally Kafka or ClickHouse
- Strong understanding of observability
- Experience automating infrastructure with Terraform and CI/CD systems
- Fluent English (our working language); German optional
Benefits
- Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on business
- Hybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when youre in town
- Focus > meetings - We time-box syncs favour async docs and protect maker time
- 28 days off - plus public holidays
- Commute perks - subsidised public transport
Location: Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)Team: Engineering Reports to CTOKeep the world awake build reliability at scaleilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.Our platform is mission-critical: customers rely o...
Location: Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)
Team: Engineering Reports to CTO
Keep the world awake build reliability at scale
ilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.
Our platform is mission-critical: customers rely on us 24/7 to keep their always-on businesses running.
As a Site Reliability Engineer at ilert youll own the reliability performance and scalability of our core platform across AWS Kubernetes Kafka and more.
Tasks
Build & operate a highly available platform
- Run and evolve our AWS-based infrastructure
- Operate and optimize self-managed Kafka ClickHouse clusters and our Observability stack
- Ensure resilience disaster recovery and capacity planning across the stack
Improve reliability & performance
- Build and maintain SLOs SLIs error budgets and observability dashboards
- Debug production issues across layers (networking Kubernetes application DB)
- Improve performance of our ingestion pipeline
Automation & tooling
- Automate operations with Terraform Helm Kubernetes operators and internal tooling
- Build tooling for safer deploys blue/green rollouts and automated verification
- Strengthen incident response workflows through deep collaboration with our AI SRE agent team
Security & compliance
- Implement best practices for workload isolation secrets management IAM and auditability
- Support our ISO27001 posture by automating controls and hardening our infrastructure
Cross-functional impact
- Partner with Backend AI and Product teams to design reliable services
- Participate in on-call rotation
- Lead post-incident reviews and drive reliability improvements long-term
Requirements
- 3 years experience as SRE Platform Engineer DevOps Engineer or Infrastructure Engineer
- Strong hands-on experience with AWS Kubernetes Linux internals networking performance tuning
- Experience operating self-managed distributed systems ideally Kafka or ClickHouse
- Strong understanding of observability
- Experience automating infrastructure with Terraform and CI/CD systems
- Fluent English (our working language); German optional
Benefits
- Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on business
- Hybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when youre in town
- Focus > meetings - We time-box syncs favour async docs and protect maker time
- 28 days off - plus public holidays
- Commute perks - subsidised public transport
View more
View less