Site Reliability Engineer (SRE) (GovTech)

AvePoint

Posted on : 23-08-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Singapore - Singapore

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 23-08-2025

Job Description

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer you will be responsible for designing and operating GitLab AWSand Kubernetes-based infrastructure and solutions that power our platform to ensure the stability scalability and performance of our runtime platform.

Responsibilities:

As a Site Reliability Engineer you will be responsible for:
Toil Reduction & Automation
Identify repetitive tasks and develop automation via CI/CD pipelines ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
Implement comprehensive observability solutions (logs metrics traces alerts) around the four Golden Signals (latency traffic errors saturation) and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
Participate in on-call rotations promptly respond to incidents to minimize MTTR and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
Design and implement solutions that are secure and compliant by collaborating with dedicated security teams conducting regular audits and integrating advanced vulnerability scanning tools.

Maintenance Optimisation & Performance
Identify and resolve performance bottlenecks and operational issues define and track KPIs (e.g. MTTR system uptime cost efficiency) and drive ongoing optimisation efforts.
Strategic Customer Engagement
Act as a technical advisor for tenants guiding them on containerization and best practices for cloud-native deployments and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
Develop and maintain detailed playbooks runbooks and documentation to facilitate team-wide knowledge sharing streamline incident response and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
Stay current with the latest AWS Kubernetes and industry developments and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.

Requirements:

Bachelors degree or Diploma in Computer Science Engineering or a related field (or equivalent experience).
Proven experience as a Site Reliability Engineer or similar role with a strong background in containerization orchestration and cloud-native technologies.
Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
Demonstrated experience with incident management including post-incident reviews and continuous improvement.
Strong documentation skills and experience in knowledge sharing across teams.
Deep understanding of AWS Kubernetes (including AWS EKS) and operational best practices with familiarity in multi-cloud or hybrid environments.
Solid grasp of networking security and storage in both AWS and Kubernetes contexts.
Experience integrating Kubernetes with AWS cloud technologies (e.g. Secrets Manager Load Balancers) and using infrastructure-as-code (Terraform or similar).
Hands-on experience with containerization tools (Kubernetes Kustomize Helm) and automation scripting (Go Python Bash or equivalent).
Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts ensuring the reliability and effectiveness of automated solutions.
Familiarity with CI/CD tools (GitLab CI/CD ArgoCD) and version control systems (Git).
Experience with observability/monitoring tools (Prometheus Grafana ELK Stack) and defining SLOs and Error Budgets.
Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
Experience with developing Kubernetes operators using Go service mesh technologies and Chaos Engineering is a plus.

Soft skills:

Proactive in identifying problems and recommending strategic solutions.
Excellent problem-solving skills with a robust analytical mindset.
Clear concise and effective communication skills; adept at collaborating across crossfunctional teams including development security and customer-facing groups.
Ability to remain calm and effective under pressure especially during incident response.
Adaptability to rapid change with a continuous learning mindset sharing knowledge to foster team growth.
Customer-focused with the ability to translate technical insights into understandable actionable guidance.
Leadership and mentoring capabilities contributing to the development of a resilient and collaborative team environment are a plus.

Employment Type

Full Time

Company Industry

Key Skills

Apply Now

About Company

AvePoint

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Site Reliability Engineer (SRE) (GovTech)

AvePoint

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Staff Engineer, SAP Development BTP

Staff Engineer, SAP Human Experience Mgmt.

Lead Engineer (Hands-On) â C#, Azure & React | $150K-$180K + Equity + Remote | High-Growth Event Planning SaaS Company

Site Reliability Engineering (SRE) Deployment Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer