DescriptionAssurity Trusted Solutions (ATS) is a wholly owned subsidiary of the Government Technology Agency (GovTech). As a Trusted Partner over the last decade ATS offers a comprehensive suite of products and services ranging from infrastructure and operational services authentication services governance and assurance services as well as managed a dynamic digital and cyber landscape where trust & collaboration are key ATS continues to drive mutually beneficial business outcomes through collaboration with GovTech government agencies and commercial partners to mitigate cyber risks and bolster security postures.
We are looking for an experienced Observability / Reliability Engineer leading our monitoring logging and observability strategy and implementation across various projects. You will design implement scalable observability solutions that provide meaningful insights into system performance enhance operational resilience and improve user experience. Partnering with engineering operations and product teams as well as external teams you will drive data-driven improvements that promote effectiveness efficiency and improve user experience while reducing manual work and enabling faster incident response in our complex and regulated environment.
You will treat observability as a product: designing owning and continuously improving observability capabilities that support performance-oriented culture.
Responsibilities
- Lead observability practices across services infrastructure and applications ensuring actionable visibility into user impacting conditions system health and system performance.
- Partner with NOC system administration and development teams to translate operational challenges into practical observability and automation solutions.
- Build and optimize the observability platform managing it as a product with roadmap planning and stakeholder engagement.
- Design service-level monitoring (e.g. SLIs SLOs synthetic checks meaningful alert thresholds) to improve early detection of service degradation and outages.
- Develop maintain and refine dashboards telemetry pipelines and log analysis capabilities that support effective and efficient troubleshooting incident response and performance optimisation.
- Identify visibility gaps impacting operational effectiveness and implement telemetry enhancements to close those gaps.
- Improve alert signal quality by reducing noise through deduplication suppression strategies maintenance window awareness and better severity mapping.
- Contribute to automation initiatives that reduce repetitive operational work and support safe and consistent remediation action.
- Enable teams through training documentation and best practices to leverage observability tools and telemetry insights.
- Contribute to service management processes including incident and problem management and blameless postmortems.
Requirements- 7 years of experience in observability engineering SRE infrastructure monitoring or related reliability-focused roles with hands-on implementation exposure.
- Strong understanding of observability fundamentals including metrics logs traces telemetry correlation and performance analysis.
- Experience with modern observability tools and stacks such as Prometheus Grafana OpenTelemetry Elastic Stack Datadog Splunk New Relic or equivalent platforms.
- Practical experience improving alert quality monitoring strategy or service visibility in production environments.
- Familiarity with service reliability concepts including service journeys SLIs SLOs alert thresholds and incident detection approaches.
- Strong systems thinking and troubleshooting skills with the ability to translate operational problems into scalable technical solutions.
- Collaborative approach with clear communications across technical and non-technical stakeholders.
- Experience working with external vendors or cross-functional teams
Nice to Have
- Background in infrastructure engineering platform operations or distributed systems environments.
- Experience with incident and problem management enterprise automation regulated environments.
- Knowledge of synthetic monitoring capacity management or resilience engineering.
Join us and discover a meaningful and exciting career with Assurity Trusted Solutions!
The remuneration package will commensurate with your qualifications and experience. Interested applicants please click Apply Now.
We thank you for your interest and please note that only shortlisted candidates will be notified.
By submitting your application you agree that your personal data may be collected used and disclosed by Assurity Trusted Solutions Pte. Ltd. (ATS) GovTech and their service providers and agents in accordance with ATSs privacy statement which can be found at: or such other successor site.
Benefits - A wholly-owned subsidiary of GovTech.
- We promote a learning culture and encourage you to grow and learn.
Required Experience:
IC
DescriptionAssurity Trusted Solutions (ATS) is a wholly owned subsidiary of the Government Technology Agency (GovTech). As a Trusted Partner over the last decade ATS offers a comprehensive suite of products and services ranging from infrastructure and operational services authentication services gov...
DescriptionAssurity Trusted Solutions (ATS) is a wholly owned subsidiary of the Government Technology Agency (GovTech). As a Trusted Partner over the last decade ATS offers a comprehensive suite of products and services ranging from infrastructure and operational services authentication services governance and assurance services as well as managed a dynamic digital and cyber landscape where trust & collaboration are key ATS continues to drive mutually beneficial business outcomes through collaboration with GovTech government agencies and commercial partners to mitigate cyber risks and bolster security postures.
We are looking for an experienced Observability / Reliability Engineer leading our monitoring logging and observability strategy and implementation across various projects. You will design implement scalable observability solutions that provide meaningful insights into system performance enhance operational resilience and improve user experience. Partnering with engineering operations and product teams as well as external teams you will drive data-driven improvements that promote effectiveness efficiency and improve user experience while reducing manual work and enabling faster incident response in our complex and regulated environment.
You will treat observability as a product: designing owning and continuously improving observability capabilities that support performance-oriented culture.
Responsibilities
- Lead observability practices across services infrastructure and applications ensuring actionable visibility into user impacting conditions system health and system performance.
- Partner with NOC system administration and development teams to translate operational challenges into practical observability and automation solutions.
- Build and optimize the observability platform managing it as a product with roadmap planning and stakeholder engagement.
- Design service-level monitoring (e.g. SLIs SLOs synthetic checks meaningful alert thresholds) to improve early detection of service degradation and outages.
- Develop maintain and refine dashboards telemetry pipelines and log analysis capabilities that support effective and efficient troubleshooting incident response and performance optimisation.
- Identify visibility gaps impacting operational effectiveness and implement telemetry enhancements to close those gaps.
- Improve alert signal quality by reducing noise through deduplication suppression strategies maintenance window awareness and better severity mapping.
- Contribute to automation initiatives that reduce repetitive operational work and support safe and consistent remediation action.
- Enable teams through training documentation and best practices to leverage observability tools and telemetry insights.
- Contribute to service management processes including incident and problem management and blameless postmortems.
Requirements- 7 years of experience in observability engineering SRE infrastructure monitoring or related reliability-focused roles with hands-on implementation exposure.
- Strong understanding of observability fundamentals including metrics logs traces telemetry correlation and performance analysis.
- Experience with modern observability tools and stacks such as Prometheus Grafana OpenTelemetry Elastic Stack Datadog Splunk New Relic or equivalent platforms.
- Practical experience improving alert quality monitoring strategy or service visibility in production environments.
- Familiarity with service reliability concepts including service journeys SLIs SLOs alert thresholds and incident detection approaches.
- Strong systems thinking and troubleshooting skills with the ability to translate operational problems into scalable technical solutions.
- Collaborative approach with clear communications across technical and non-technical stakeholders.
- Experience working with external vendors or cross-functional teams
Nice to Have
- Background in infrastructure engineering platform operations or distributed systems environments.
- Experience with incident and problem management enterprise automation regulated environments.
- Knowledge of synthetic monitoring capacity management or resilience engineering.
Join us and discover a meaningful and exciting career with Assurity Trusted Solutions!
The remuneration package will commensurate with your qualifications and experience. Interested applicants please click Apply Now.
We thank you for your interest and please note that only shortlisted candidates will be notified.
By submitting your application you agree that your personal data may be collected used and disclosed by Assurity Trusted Solutions Pte. Ltd. (ATS) GovTech and their service providers and agents in accordance with ATSs privacy statement which can be found at: or such other successor site.
Benefits - A wholly-owned subsidiary of GovTech.
- We promote a learning culture and encourage you to grow and learn.
Required Experience:
IC
View more
View less