The Role
Were building a team that owns production incident response deep debugging and permanent fixes across application data and deployment layers.
This is not a tickets-only ops role. You will write code ship fixes safely and harden the platform so issues dont repeat.
Note: This is a SRE/software engineering role with real production ownership. Youll combine engineering and operations to own outcomes end-to-end: investigate incidents ship code fixes and prevent repeat issues through tests observability and hardening.
What youll do
Lead and execute production incident response: triage mitigation stakeholder communication and coordination across teams
Debug and fix issues across Go services (mandatory) and the broader stack ( services where relevant)
Work across service boundaries: GraphQL/RPC distributed tracing dependency failures performance bottlenecks and safe degradation patterns
Troubleshoot Kubernetes workloads and deployments
Diagnose PostgreSQL/CNPG issues
Handle production bugs that span application data pipelines (ETL/Snowflake mappings) including backfills/replays and data-quality validation
Build prevention: add regression tests improve observability and maintain run books/service passports
Drive reliability improvements: SLOs/SLIs alert quality release readiness checks and operational standards across teams
Requirements
What were looking for (must-have)
- 7 years in SRE / Production Engineering / Platform Engineering (reliability-focused)
- Strong Go (mandatory): ability to read debug and ship production fixes in Go codebases
- Proven experience debugging distributed systems in production (latency error rates timeouts retries cascading failures)
- Strong hands-on experience with Kubernetes in production environments
- Experience with Helm and GitOps workflows (FluxCD preferred; ArgoCD acceptable)
- Solid PostgreSQL troubleshooting experience (performance incident patterns migrations)
- Observability experience (metrics/logging/tracing; Datadog/Grafana/Tempo/Loki experience is a plus)
- Strong incident leadership: calm under pressure clear communication structured problem-solving
- Engineering hygiene: PR discipline reviews testing mindset safe rollouts/rollbacks
- Comfortable with IAM/security fundamentals in real production systems: OAuth2/OIDC basics RBAC/least privilege and safe secrets handling
Nice-to-have
- backend experience in production
- Experience in FinTech / regulated environments / high-availability systems (auditability change control incident rigor)
- Data reliability experience: ETL monitoring reconciliation Snowflake operations schema/mapping drift handling
- Reliability patterns common to trading/fintech platforms: correctness and data integrity mindset (idempotency reconciliation) resilient partner integrations and strong observability for critical user journeys
Why join
- Build a new function with real impact on reliability and engineering culture
- Work across the full production surface area: application platform database data pipelines Our Whole Product
- High ownership role: youll influence production standards tooling and release safety across teams.
Required Skills:
Primary Skill SAAS DMS; Secondary Skill SAAS;
Required Education:
Degree Post graduate in Computer Science or related field (or equivalent industry experience)
The RoleWere building a team that owns production incident response deep debugging and permanent fixes across application data and deployment layers.This is not a tickets-only ops role. You will write code ship fixes safely and harden the platform so issues dont repeat.Note: This is a SRE/software e...
The Role
Were building a team that owns production incident response deep debugging and permanent fixes across application data and deployment layers.
This is not a tickets-only ops role. You will write code ship fixes safely and harden the platform so issues dont repeat.
Note: This is a SRE/software engineering role with real production ownership. Youll combine engineering and operations to own outcomes end-to-end: investigate incidents ship code fixes and prevent repeat issues through tests observability and hardening.
What youll do
Lead and execute production incident response: triage mitigation stakeholder communication and coordination across teams
Debug and fix issues across Go services (mandatory) and the broader stack ( services where relevant)
Work across service boundaries: GraphQL/RPC distributed tracing dependency failures performance bottlenecks and safe degradation patterns
Troubleshoot Kubernetes workloads and deployments
Diagnose PostgreSQL/CNPG issues
Handle production bugs that span application data pipelines (ETL/Snowflake mappings) including backfills/replays and data-quality validation
Build prevention: add regression tests improve observability and maintain run books/service passports
Drive reliability improvements: SLOs/SLIs alert quality release readiness checks and operational standards across teams
Requirements
What were looking for (must-have)
- 7 years in SRE / Production Engineering / Platform Engineering (reliability-focused)
- Strong Go (mandatory): ability to read debug and ship production fixes in Go codebases
- Proven experience debugging distributed systems in production (latency error rates timeouts retries cascading failures)
- Strong hands-on experience with Kubernetes in production environments
- Experience with Helm and GitOps workflows (FluxCD preferred; ArgoCD acceptable)
- Solid PostgreSQL troubleshooting experience (performance incident patterns migrations)
- Observability experience (metrics/logging/tracing; Datadog/Grafana/Tempo/Loki experience is a plus)
- Strong incident leadership: calm under pressure clear communication structured problem-solving
- Engineering hygiene: PR discipline reviews testing mindset safe rollouts/rollbacks
- Comfortable with IAM/security fundamentals in real production systems: OAuth2/OIDC basics RBAC/least privilege and safe secrets handling
Nice-to-have
- backend experience in production
- Experience in FinTech / regulated environments / high-availability systems (auditability change control incident rigor)
- Data reliability experience: ETL monitoring reconciliation Snowflake operations schema/mapping drift handling
- Reliability patterns common to trading/fintech platforms: correctness and data integrity mindset (idempotency reconciliation) resilient partner integrations and strong observability for critical user journeys
Why join
- Build a new function with real impact on reliability and engineering culture
- Work across the full production surface area: application platform database data pipelines Our Whole Product
- High ownership role: youll influence production standards tooling and release safety across teams.
Required Skills:
Primary Skill SAAS DMS; Secondary Skill SAAS;
Required Education:
Degree Post graduate in Computer Science or related field (or equivalent industry experience)
View more
View less