Role Overview
We are seeking an experienced and proactive Observability Lead to take ownership of the visibility reliability and performance monitoring of all production systems across the organisation.
This role is responsible for ensuring that infrastructure applications databases and critical services are fully monitored in real time enabling early issue detection rapid incident response and continuous service improvement. The ideal candidate will build a strong observability culture by implementing best-in-class monitoring alerting logging and performance management practices.
You will work closely with Engineering DevOps Security Product and Support teams to maintain highly available and resilient systems in a fast-paced fintech environment.
Required Skills:
1. Observability Strategy & Ownership
- Develop and lead the company-wide observability strategy across infrastructure applications cloud environments databases and internal services.
- Establish monitoring standards frameworks and governance for all production workloads.
- Ensure real-time visibility into system health performance availability and capacity.
- Build a proactive reliability culture through data-driven monitoring practices.
2. Monitoring & Alerting Management
- Ensure 100% monitoring coverage across all critical production services.
- Design configure and maintain dashboards alerts logs metrics and distributed tracing systems.
- Continuously optimise alert thresholds to reduce noise and eliminate false positives.
- Maintain centralised monitoring systems accessible to relevant teams.
3. Incident Detection & Operational Response
- Ensure incidents are detected internally before customer impact whenever possible.
- Lead operational response during outages degradations and system anomalies.
- Coordinate cross-functional teams during incident resolution.
- Drive post-incident reviews root cause analysis (RCA) and corrective action plans.
4. Performance Monitoring & Optimization
- Track system latency throughput resource utilization and application performance metrics.
- Identify performance bottlenecks and collaborate with engineering teams on remediation.
- Support load readiness scaling decisions and capacity planning.
- Improve platform stability and service responsiveness over time.
5. Reporting & Insights
- Produce weekly and monthly reports on system health uptime incident trends and risk areas.
- Provide executive dashboards for leadership visibility into platform performance.
- Use operational data to recommend improvements and investment priorities.
6. Collaboration & Leadership
- Partner with Engineering DevOps Security and Product teams to embed observability into all deployments.
- Support teams with troubleshooting diagnostics and production readiness reviews.
- Mentor engineers on monitoring best practices and observability tooling.
- Act as the subject matter expert for reliability monitoring and operational intelligence.
Role OverviewWe are seeking an experienced and proactive Observability Lead to take ownership of the visibility reliability and performance monitoring of all production systems across the organisation.This role is responsible for ensuring that infrastructure applications databases and critical servi...
Role Overview
We are seeking an experienced and proactive Observability Lead to take ownership of the visibility reliability and performance monitoring of all production systems across the organisation.
This role is responsible for ensuring that infrastructure applications databases and critical services are fully monitored in real time enabling early issue detection rapid incident response and continuous service improvement. The ideal candidate will build a strong observability culture by implementing best-in-class monitoring alerting logging and performance management practices.
You will work closely with Engineering DevOps Security Product and Support teams to maintain highly available and resilient systems in a fast-paced fintech environment.
Required Skills:
1. Observability Strategy & Ownership
- Develop and lead the company-wide observability strategy across infrastructure applications cloud environments databases and internal services.
- Establish monitoring standards frameworks and governance for all production workloads.
- Ensure real-time visibility into system health performance availability and capacity.
- Build a proactive reliability culture through data-driven monitoring practices.
2. Monitoring & Alerting Management
- Ensure 100% monitoring coverage across all critical production services.
- Design configure and maintain dashboards alerts logs metrics and distributed tracing systems.
- Continuously optimise alert thresholds to reduce noise and eliminate false positives.
- Maintain centralised monitoring systems accessible to relevant teams.
3. Incident Detection & Operational Response
- Ensure incidents are detected internally before customer impact whenever possible.
- Lead operational response during outages degradations and system anomalies.
- Coordinate cross-functional teams during incident resolution.
- Drive post-incident reviews root cause analysis (RCA) and corrective action plans.
4. Performance Monitoring & Optimization
- Track system latency throughput resource utilization and application performance metrics.
- Identify performance bottlenecks and collaborate with engineering teams on remediation.
- Support load readiness scaling decisions and capacity planning.
- Improve platform stability and service responsiveness over time.
5. Reporting & Insights
- Produce weekly and monthly reports on system health uptime incident trends and risk areas.
- Provide executive dashboards for leadership visibility into platform performance.
- Use operational data to recommend improvements and investment priorities.
6. Collaboration & Leadership
- Partner with Engineering DevOps Security and Product teams to embed observability into all deployments.
- Support teams with troubleshooting diagnostics and production readiness reviews.
- Mentor engineers on monitoring best practices and observability tooling.
- Act as the subject matter expert for reliability monitoring and operational intelligence.
View more
View less