Site Reliability Engineer (4024)
Kuala Lumpur - Malaysia
Job Summary
Enabling safe and rewarding digital lives for genuine people everywhere
We make it our mission to ensure more genuine people have digital access to opportunities and businesses have access to more genuine people. Our technology draws on diverse and reliable data to create a single point of truth for identity and address verification.
With over 30 years of experience behind us our team and technology are focused on enabling safe and rewarding digital lives for everyone. Regardless of age location or background genuine people everywhere should be able to digitally prove who they are and where they live.
About the team and role
Global Fraud Solutions
The team provides decision support solutions to address business objectives in risk prevention and fraud detection. We deliver software solutions and offer client support using our expertise and a client-focused approach.
Site Reliability Engineer
The SRE will build and operate the reliability observability and operational excellence infrastructure underpinning the GFS managed fraud detection platforms. You will work across deployment pipelines cloud infrastructure monitoring and incident management ensuring GBG can deliver on high availability SLAs for banking and fintech customers who depend on real-time fraud detection at scale.
What you will do
- Design and operate the SRE practice for Managed oferings including on-call processes SLA frameworks incident response playbooks and post-incident review (PIR) processes.
- Build and maintain observability infrastructure: centralised logging (correlation IDs) metrics dashboards distributed tracing and alerting for the Predator/Instinct platform stack.
- Define and track SLOs (Service Level Objectives) and error budgets for real-time transaction processing pipelines targeting high TPS and low round-trip latency.
- Manage cloud infrastructure provisioning and configuration using IaC tooling (Terraform Helm) supporting both AWS/Azure cloud deployments and on-premises customer environments.
- Implement and maintain CI/CD pipelines for GFS solutions (Jenkins etc.)
- Work with Engineering teams to ensure security and compliance readiness for Managed services including PCI DSS ISO 27001 SOC 1/2/3 PDPA/GDPR in close coordination with InfoSec teams.
- Drive platform resilience improvements: high availability auto-scaling disaster recovery backup/restore procedures and chaos engineering practices.
- Manage secrets certificate rotation identity/access controls (OAuth/RBAC) and vulnerability management for the hosted environment.
- Support performance testing methodology and baseline establishment for our products.
- Contribute to the Architecture Review Committee (ARC) with SRE and operational perspectives on technology choices.
- Collaborate with engineering squads to embed reliability and DevSecOps practices across the SDLC.
Skills were looking for
- Minimum 5 years of solid hands-on experience in a Site Reliability Platform Engineering or DevOps role ideally supporting mission-critical real-time processing systems in banking payments or fintech.
- Strong proficiency with cloud platforms (AWS preferred; Azure/GCP acceptable) including networking compute storage and managed services.
- Deep expertise with containerisation and orchestration: Docker Kubernetes (EKS/AKS/GKE) Helm and associated tooling.
- Infrastructure as Code experience: Terraform (required) and familiarity with Ansible or Pulumi.
- Observability stack proficiency: Prometheus Grafana ELK/OpenSearch Jaeger/Zipkin or equivalent enterprise-grade tooling.
- CI/CD pipeline design and management: GitHub Actions Jenkins ArgoCD or equivalent.
- Experience with security and compliance frameworks applicable to hosted financial services: PCI DSS ISO 27001 SOC 1/2/3 GDPR/PDPA.
- Familiarity with database reliability practices for SQL Server PostgreSQL and Oracle including replication read replicas and failover.
- Working knowledge of secrets management (HashiCorp Vault AWS Secrets Manager) and zero-trust identity principles.
- Experience supporting real-time streaming or event-driven architectures (Kafka RisingWave or similar) in production environments.
- Scripting and automation proficiency: Python Bash or Go for operational tooling.
- Strong sense of operational ownership and accountability comfortable being on-call and driving incidents to resolution.
- Excellent communication skills able to produce clear incident reports runbooks and architecture documentation for both technical and executive audiences.
- Proactive mindset: identifies reliability risks before they become incidents and champions a culture of blameless post-mortems.
- Collaborative and effective working with software engineers product managers and InfoSec teams.
- Continuous improvement orientation always looking to reduce toil automate repetitive tasks and improve platform resilience.
- Flexible and adaptable able to support a globally distributed product with customers across multiple time zones.
To find out more
As an equal opportunity employer we are dedicated to creating a diverse and inclusive workplace where everyone feels valued and empowered. Please inform your GBG Talent Attraction Partner if you require any reasonable adjustments to the interview process.
To chat to the Talent Attraction team and find out more about our benefits and why were a great place to work drop an email to and well be in touch. You can also find out more about careers at GBG and check out our current opportunities at
Required Experience:
IC
About Company
We are GBG, global specialists in digital identity. We enable fast, simple and compliant customer onboarding, reducing the risk of fraud for many of the world’s leading organisations. Working with the best data, the best technology and the best people, we make it possible to balance t ... View more