We are looking for a Senior Site Reliability Engineer to join our Site Reliability Engineering (SRE) this role youll drive the reliability scalability and performance of our platform ensuring our systems remain stable as we grow. We value innovation and are seeking someone eager to bring fresh ideas especially around building automation that reduces manual effort and improving distributed systems resilience.
This isnt a top-down organization; our engineers are the ones who flag technical challenges and design the solutions. You will collaborate closely with Platform Engineering Security AI Platform and Product teams to design durable systems and make data-driven operational decisions.
What Youll Do
Collaborate with Engineering Platform and Security teams to embed SRE best practices early in system design.
Lead advancements in observability monitoring alerting and incident-response workflows.
Analyze platform performance to contribute to cost-optimization performance tuning and resilience planning.
Build infrastructure and automation tooling that improves platform reliability and enhances deployment safety.
Diagnose and resolve complex production issues across distributed systems and drive open post-incident reviews so failures translate into durable improvements.
Strengthen system consistency and author clear concise documentation for runbooks and operational processes.
Who You Are
4 years of experience in SRE DevOps platform engineering or similar production-facing roles.
Strong problem-solving and debugging skills in distributed systems to maintain higher platform stability.
Eager to share operational guidelines champion SRE practices across teams and openly discuss what we can learn from system failures.
Excellent communication skills (English is our default language) with a genuine collaborative approach to working across diverse engineering teams.
Strong hands-on experience with cloud environments (AWS GCP or similar) and proficiency with infrastructure-as-code and CI/CD pipelines.
Familiarity with Kubernetes (or container orchestration) event-driven architectures or supporting ML/AI workloads and GPU infrastructure.
What Success Looks Like:
Within 3 Months:
Fully onboarded into the Rossum ecosystem gaining a deep understanding of our infrastructure observability stack and SRE processes while building relationships across the team.
Gaining a deep understanding of our synergy with Coupa and our shared roadmap.
Initial Impact Goal: Improve a small reliability issue or add value to an existing automation or monitoring area.
Within 6 Months:
Independently managing key responsibilities owning recurring reliability tasks and identifying areas for strategic improvement.
Actively participating in the alignment of processes within the new Coupa organizational structure.
Operational KPI: Implement measurable enhancements to alert quality CI/CD reliability or service health metrics.
Within 12 Months:
Recognized as a subject matter expert within the team navigating the global Coupa ecosystem.
Successfully contributing to Rossums mission at a massive scale using new global resources.
Long-Term Strategic Goal: Lead a major reliability or infrastructure initiative providing technical recommendations to guide our long-term reliability strategy.
Cutting-edge AI technology reshaping how businesses operate globally.
A collaborative supportive environment where autonomy thrives.
Opportunities to grow in a fast-scaling company.
A culture that values diversity empathy and genuine connection.
As part of the Coupa family youll enjoy the agility of a fast-moving innovation-focused team with the stability and reach of a global market leader. For you this means an even greater opportunity to make an impact access new global markets and grow your career within a collaborative culture that values autonomy diversity and genuine connection. Together were not just automating datawere giving time back to the worlds professionals.
What we offer
Future with Coupa: We are currently in an integration phase during which we are reviewing and aligning our total rewards programs. Our goal is to blend Rossums local culture with Coupas global standards to provide you with a long-term future featuring clear career pathways tailored learning journeys and world-class development opportunities.
Current Benefits:
Flexible working models with a base in vibrant Prague and options for hybrid setup.
Competitive benefits designed to support your well-being growth and work-life harmony.
5 weeks of vacation 5 sick/personal days and extra 2 weeks of paternity leave.
Personal development education and language courses budget.
High-end tech (MacBook external monitor keyboard of your choice) and a MultiSport card.
Team offsites regular meetups and a friendly ambitious team.
Ready to make an impact in your next role Apply now!
Required Experience:
Senior IC
We are looking for a Senior Site Reliability Engineer to join our Site Reliability Engineering (SRE) this role youll drive the reliability scalability and performance of our platform ensuring our systems remain stable as we grow. We value innovation and are seeking someone eager to bring fresh idea...
We are looking for a Senior Site Reliability Engineer to join our Site Reliability Engineering (SRE) this role youll drive the reliability scalability and performance of our platform ensuring our systems remain stable as we grow. We value innovation and are seeking someone eager to bring fresh ideas especially around building automation that reduces manual effort and improving distributed systems resilience.
This isnt a top-down organization; our engineers are the ones who flag technical challenges and design the solutions. You will collaborate closely with Platform Engineering Security AI Platform and Product teams to design durable systems and make data-driven operational decisions.
What Youll Do
Collaborate with Engineering Platform and Security teams to embed SRE best practices early in system design.
Lead advancements in observability monitoring alerting and incident-response workflows.
Analyze platform performance to contribute to cost-optimization performance tuning and resilience planning.
Build infrastructure and automation tooling that improves platform reliability and enhances deployment safety.
Diagnose and resolve complex production issues across distributed systems and drive open post-incident reviews so failures translate into durable improvements.
Strengthen system consistency and author clear concise documentation for runbooks and operational processes.
Who You Are
4 years of experience in SRE DevOps platform engineering or similar production-facing roles.
Strong problem-solving and debugging skills in distributed systems to maintain higher platform stability.
Eager to share operational guidelines champion SRE practices across teams and openly discuss what we can learn from system failures.
Excellent communication skills (English is our default language) with a genuine collaborative approach to working across diverse engineering teams.
Strong hands-on experience with cloud environments (AWS GCP or similar) and proficiency with infrastructure-as-code and CI/CD pipelines.
Familiarity with Kubernetes (or container orchestration) event-driven architectures or supporting ML/AI workloads and GPU infrastructure.
What Success Looks Like:
Within 3 Months:
Fully onboarded into the Rossum ecosystem gaining a deep understanding of our infrastructure observability stack and SRE processes while building relationships across the team.
Gaining a deep understanding of our synergy with Coupa and our shared roadmap.
Initial Impact Goal: Improve a small reliability issue or add value to an existing automation or monitoring area.
Within 6 Months:
Independently managing key responsibilities owning recurring reliability tasks and identifying areas for strategic improvement.
Actively participating in the alignment of processes within the new Coupa organizational structure.
Operational KPI: Implement measurable enhancements to alert quality CI/CD reliability or service health metrics.
Within 12 Months:
Recognized as a subject matter expert within the team navigating the global Coupa ecosystem.
Successfully contributing to Rossums mission at a massive scale using new global resources.
Long-Term Strategic Goal: Lead a major reliability or infrastructure initiative providing technical recommendations to guide our long-term reliability strategy.
Cutting-edge AI technology reshaping how businesses operate globally.
A collaborative supportive environment where autonomy thrives.
Opportunities to grow in a fast-scaling company.
A culture that values diversity empathy and genuine connection.
As part of the Coupa family youll enjoy the agility of a fast-moving innovation-focused team with the stability and reach of a global market leader. For you this means an even greater opportunity to make an impact access new global markets and grow your career within a collaborative culture that values autonomy diversity and genuine connection. Together were not just automating datawere giving time back to the worlds professionals.
What we offer
Future with Coupa: We are currently in an integration phase during which we are reviewing and aligning our total rewards programs. Our goal is to blend Rossums local culture with Coupas global standards to provide you with a long-term future featuring clear career pathways tailored learning journeys and world-class development opportunities.
Current Benefits:
Flexible working models with a base in vibrant Prague and options for hybrid setup.
Competitive benefits designed to support your well-being growth and work-life harmony.
5 weeks of vacation 5 sick/personal days and extra 2 weeks of paternity leave.
Personal development education and language courses budget.
High-end tech (MacBook external monitor keyboard of your choice) and a MultiSport card.
Team offsites regular meetups and a friendly ambitious team.
Ready to make an impact in your next role Apply now!