Iterable Cloud Site Reliability Engineer

Damia Group

Job Location:

Lisbon - Portugal

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Iterable is currently hiring a Cloud Site Reliability Engineer to join their team

Iterable is a top-rated AI-powered multichannel customer engagement platform that enables organizations to deliver personalized and joyful customer experiences at scale. Trusted by over 1200 global brands including Airbnb Grammarly Wolt Box and Calm Iterable helps companies harmonize cross-channel communications design dynamic data-driven campaigns and optimize engagement throughout the customer lifecycle.

With over 12 years of innovation Iterable is closing the data activation gap through its intuitive open architecture and enterprise-grade security. Backed by top-tier investors such as Index Ventures and CRV the company continues its global expansion with the opening of a cross-functional office in Lisbon. This new hub will focus on innovation product development and supporting European growth.

Iterable has been recognized as one of the best places to work by Forbes Inc. and Wealthfront with a culture centered on trust a growth mindset balance and humility.

About the role:As a Senior Engineer on the Cloud Platform team your impact will be measured by the continuous improvement of our platforms reliability scalability and security posture.
SLO Ownership & Error Budget Management: Take direct ownership of the established Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for core platform services (e.g. latency availability error rate). You will manage and use the Error Budget as the primary drivers to prioritize reliability workScale and HArden the Core Platform: Apply deep technical expertise in Kubernetes AWS traffic management and Infrastructure-as-Code to scale and harden the foundational platform that powers Iterables product Systemic Improvements: This role centers on hands-on engineering skill technical leadership and systemic reliability improvements within our complex distributed multi-region platform.

What youll do:

Kubernetes Platform Engineering
Use your Kubernetes and AWS expertise to evolve EKS lifecycle multi-tenant isolation and regional consistency ensuring clusters remain secure performant and predictable as we scale.
Traffic & Ingress Reliability
Apply advanced knowledge of cloud-native traffic management and API gateways to strengthen routing authentication rate-limiting and secure communication protocols (like mTLS). This focus will dramatically improve both the reliability and security posture of the platforms public and internal service access points.
Infrastructure-as-Code at Scale
Demonstrate mastery in IaC to manage complex multi-region architecture. Use tools like Terraform Cloud to build reusable modules validate changes through policy-as-code and establish safe multi-account patterns our teams can rely on.
Security & Access Control
Drive a zero-trust posture by establishing service guardrails and access controls across the platform: This includes: implementing policy-as-code solutionsbrokering least-privilege access for platform using cloud Identity and Access Management (IAM) best practices and Integrating and managing identity providers to define Role-Based Access Control (RBAC) across environments.
Reliability Engineering & Incident Leadership
Demonstrate strong diagnostic and incident-response leadership to rapidly isolate issues across clusters networks and workloads. Your ultimate responsibility will be to lead and drive systemic long-term fixes and root-cause investigations ensuring all necessary actions are taken to eliminate repeat failures .
Collaboration Influence & Mentorship
Guide and influence engineering teams across the organization through design reviews operational best practices and reliability-focused decision-making.

Requirements:

Core Platform & Infrastructure Expertise
Demonstrate deep skill in managing complex distributed environments at scale specifically focusing on:
Cloud-Native Orchestration: Expertise in Kubernetes.
Infrastructure Automation: Master of Infrastructure-as-code (IaC) including Terraform.
Advanced Networking & Connectivity: understanding of core networking fundamentals including routing DNS network segmentation () and connectivity services (e.g. transit gateways and network endpoints)
Platform Systems: Deep competence in traffic/ingress systems and strong programming fundamentals in Go or Python
Security & Reliability Skills
Fluency with IAM/IRSA Vault mTLS and least-privilege design combined with a proven ability to deliver measurable reliability improvements through automation guardrails and smart engineering.
Leadership & Communication
Demonstrate a strong operational mindset excellent technical communication (both written and verbal) and the ability to influence designs mentor others and elevate platform engineering practices across teams.

Experience and Proficiency
Demonstrate advanced proficiency and technical leadership in managing large-scale resilient production systems. This experience is typically gained through roles such as:
-Site Reliability Engineer (SRE)
-Cloud Platform Engineer
-DevOps Engineer
-Other closely related infrastructure role

Their tech stack:

Programming Language: Scala
Databases: Elasticsearch Postgres Redis CRDB
Infrastructure: Pulsar Kafka AWS
Other Relevant Technologies: Docker / Kubernetes React

Perks & Benefits:

Competitive salaries & meaningful equity
Private Medical Insurance
Life/Risk Assurance
Meal Allowance: 8.55 per day
Balance Days (additional paid holidays)
Paid Annual Leave (22 days)
Public Holidays (14 days)
Paid Sabbatical
Complete laptop workstation

Want to know more Get in touch with us

Required Experience:

Iterable is currently hiring a Cloud Site Reliability Engineer to join their teamIterable is a top-rated AI-powered multichannel customer engagement platform that enables organizations to deliver personalized and joyful customer experiences at scale. Trusted by over 1200 global brands including Airb...

Iterable is currently hiring a Cloud Site Reliability Engineer to join their team

What youll do:

Kubernetes Platform Engineering
Use your Kubernetes and AWS expertise to evolve EKS lifecycle multi-tenant isolation and regional consistency ensuring clusters remain secure performant and predictable as we scale.
Traffic & Ingress Reliability
Apply advanced knowledge of cloud-native traffic management and API gateways to strengthen routing authentication rate-limiting and secure communication protocols (like mTLS). This focus will dramatically improve both the reliability and security posture of the platforms public and internal service access points.
Infrastructure-as-Code at Scale
Demonstrate mastery in IaC to manage complex multi-region architecture. Use tools like Terraform Cloud to build reusable modules validate changes through policy-as-code and establish safe multi-account patterns our teams can rely on.
Security & Access Control
Drive a zero-trust posture by establishing service guardrails and access controls across the platform: This includes: implementing policy-as-code solutionsbrokering least-privilege access for platform using cloud Identity and Access Management (IAM) best practices and Integrating and managing identity providers to define Role-Based Access Control (RBAC) across environments.
Reliability Engineering & Incident Leadership
Demonstrate strong diagnostic and incident-response leadership to rapidly isolate issues across clusters networks and workloads. Your ultimate responsibility will be to lead and drive systemic long-term fixes and root-cause investigations ensuring all necessary actions are taken to eliminate repeat failures .
Collaboration Influence & Mentorship
Guide and influence engineering teams across the organization through design reviews operational best practices and reliability-focused decision-making.

Requirements:

Core Platform & Infrastructure Expertise
Demonstrate deep skill in managing complex distributed environments at scale specifically focusing on:
Cloud-Native Orchestration: Expertise in Kubernetes.
Infrastructure Automation: Master of Infrastructure-as-code (IaC) including Terraform.
Advanced Networking & Connectivity: understanding of core networking fundamentals including routing DNS network segmentation () and connectivity services (e.g. transit gateways and network endpoints)
Platform Systems: Deep competence in traffic/ingress systems and strong programming fundamentals in Go or Python
Security & Reliability Skills
Fluency with IAM/IRSA Vault mTLS and least-privilege design combined with a proven ability to deliver measurable reliability improvements through automation guardrails and smart engineering.
Leadership & Communication
Demonstrate a strong operational mindset excellent technical communication (both written and verbal) and the ability to influence designs mentor others and elevate platform engineering practices across teams.

Experience and Proficiency
Demonstrate advanced proficiency and technical leadership in managing large-scale resilient production systems. This experience is typically gained through roles such as:
-Site Reliability Engineer (SRE)
-Cloud Platform Engineer
-DevOps Engineer
-Other closely related infrastructure role

Their tech stack: