Site Reliability Engineer

IO TECH SOLUTIONS LIMITED

Job Location:

Hong Kong - Hong Kong

Monthly Salary: Not Disclosed

Posted on: 2 days ago

Vacancies: 1 Vacancy

Job Summary

Position Overview

We are seeking an experienced Support Analyst responsible for the operational ownership of build and shared services including monitoring SRE (Site Reliability Engineering) and the stability and performance of critical systems.

Key Responsibilities

Monitor and support SRE operations to ensure reliability availability and performance of production systems.
Build enhance and maintain monitoring solutions using:
- ITRS Geneos
- Prometheus
- Victoria-Metrics
- Elasticsearch
- Grafana
Design and maintain alerting rules dashboards and observability pipelines.
Troubleshoot Linux servers (RHEL 7/8/9) including:
- upgrades configuration changes patching and maintenance
- assessing monitoring needs for system changes
Perform log analysis and fault finding to identify and resolve performance exceptions.
Collaborate with engineering application and infrastructure teams to improve:
- resilience stability security efficiency and scalability
Participate in on-call rotations including off-hours and weekend support.
Support Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
Stay current with modern monitoring/SRE tools and practices and continuously drive improvements.

Requirements

Bachelors degree in Computer Science / Engineering.
810 years of IT experience preferably within an investment bank or similar environment.
Strong hands-on experience with monitoring and observability platforms including:
- ITRS Geneos
- Prometheus
- Victoria-Metrics
- Elasticsearch
- Grafana
- Kibana
Hands-on experience building and operating Prometheus pipelines including:
- exporters
- scraping configurations
- relabeling / metric routing
- integrations with long-term storage (e.g. Victoria-Metrics)
Experience building and maintaining Logstash pipelines including:
- ingestion parsing filtering enrichment and routing
- log delivery into Elasticsearch
Ability to design build and maintain Grafana and Kibana dashboards for metrics logs and performance analytics across distributed systems.
Strong understanding of:
- metrics logs alerting dashboards and observability pipelines
Strong Linux administration skills (RHEL 7/8/9) including troubleshooting upgrades patching configuration and performance optimization.
Good understanding of SRE principles including:
- high availability scalability
- incident management
- DR / BCP activities
Automation experience is an advantage e.g. Bash Python Ansible and CI/CD tooling.
Understanding of networking fundamentals performance tuning and troubleshooting distributed systems.
Prior experience in Production Support / SRE / Monitoring Engineering / Shared Services Operations including participation in on-call rotations (after-hours and weekends).
Self-motivated adaptable able to prioritize learn continuously and manage multiple responsibilities.
Fluent in English and Chinese.

Position Overview We are seeking an experienced Support Analyst responsible for the operational ownership of build and shared services including monitoring SRE (Site Reliability Engineering) and the stability and performance of critical systems.Key Responsibilities Monitor and support SRE operation...

Position Overview

Key Responsibilities

Monitor and support SRE operations to ensure reliability availability and performance of production systems.
Build enhance and maintain monitoring solutions using:
- ITRS Geneos
- Prometheus
- Victoria-Metrics
- Elasticsearch
- Grafana
Design and maintain alerting rules dashboards and observability pipelines.
Troubleshoot Linux servers (RHEL 7/8/9) including:
- upgrades configuration changes patching and maintenance
- assessing monitoring needs for system changes
Perform log analysis and fault finding to identify and resolve performance exceptions.
Collaborate with engineering application and infrastructure teams to improve:
- resilience stability security efficiency and scalability
Participate in on-call rotations including off-hours and weekend support.
Support Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
Stay current with modern monitoring/SRE tools and practices and continuously drive improvements.

Requirements

Bachelors degree in Computer Science / Engineering.
810 years of IT experience preferably within an investment bank or similar environment.
Strong hands-on experience with monitoring and observability platforms including:
- ITRS Geneos
- Prometheus
- Victoria-Metrics
- Elasticsearch
- Grafana
- Kibana
Hands-on experience building and operating Prometheus pipelines including:
- exporters
- scraping configurations
- relabeling / metric routing
- integrations with long-term storage (e.g. Victoria-Metrics)
Experience building and maintaining Logstash pipelines including:
- ingestion parsing filtering enrichment and routing
- log delivery into Elasticsearch
Ability to design build and maintain Grafana and Kibana dashboards for metrics logs and performance analytics across distributed systems.
Strong understanding of:
- metrics logs alerting dashboards and observability pipelines
Strong Linux administration skills (RHEL 7/8/9) including troubleshooting upgrades patching configuration and performance optimization.
Good understanding of SRE principles including:
- high availability scalability
- incident management
- DR / BCP activities
Automation experience is an advantage e.g. Bash Python Ansible and CI/CD tooling.
Understanding of networking fundamentals performance tuning and troubleshooting distributed systems.
Prior experience in Production Support / SRE / Monitoring Engineering / Shared Services Operations including participation in on-call rotations (after-hours and weekends).
Self-motivated adaptable able to prioritize learn continuously and manage multiple responsibilities.
Fluent in English and Chinese.