Lead Senior Reliability Engineer

Manatal Co LTD

Job Location:

Bangkok - Thailand

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Manatal is an HRTech software service (B2B SaaS) company headquartered in Bangkok Thailand. Manatal is one of the fastest-growing start-ups in the region and is backed by Surge and Sequoia Capital.

Manatal has a global presence and is trusted by thousands of businesses in over 135 countries. Our goal is to transform the entire hiring process by making it simple efficient and enjoyable for recruiters hiring managers and candidates alike. Our mission is to offer the best-in-class AI-powered technologies to empower small medium and large businesses in their staffing & recruitment transformation.

Lead Senior Reliability Engineer (SRE)

Manatal is establishing a dedicated Site Reliability Engineering function for the first time. As the platform and customer base grow the need for structured incident response mature observability and reliable on-call operations requires dedicated ownership.

The Lead Site Reliability Engineer will be the first dedicated hire in this function. Working closely with the CTO the Director of Engineering and the Director of Product this role owns incident response on-call operations and the observability stack. When production incidents occur this person leads. Between incidents they improve the systems alerting and processes that detect and prevent issues.

This is a senior individual contributor role with significant cross-functional authority. The successful candidate will ensure platform reliability and security posture. There are no direct reports initially though the function may grow over time.

Responsibilities

Incident Response

Lead all production incidents: triage coordinate response across engineering teams communicate status to stakeholders and drive resolution.
Own the incident management process end-to-end: detection escalation communication resolution and post-mortem.
Write and maintain incident runbooks to enable faster response and reduce dependency on individual knowledge.
Conduct blameless post-mortems after every significant incident. Track follow-up actions and ensure they are completed.

On-Call Operations

Own and lead the on-call rotation: define schedules escalation policies and response expectations.
Continuously improve on-call quality by reducing alert noise eliminating false positives and ensuring alerts are actionable.
Analyze on-call patterns to identify recurring issues and drive permanent fixes with engineering teams.

Observability & Monitoring

Own the observability stack: Datadog (infrastructure monitoring APM logging) and Sentry (error monitoring).
Build and maintain dashboards alerts and SLOs that provide meaningful visibility into platform health.
Ensure application performance monitoring covers critical user flows and API endpoints.
Identify gaps in observability coverage and systematically close them.

Security Incident Response

Lead the response to security incidents (data breaches unauthorized access credential compromise) in coordination with the Head of Security & Compliance.
Ensure security-related alerts are properly configured in the monitoring stack and escalation paths are clear.
Participate in penetration test remediation validation and post-incident security reviews.

Reliability & Process Improvement

Drive reliability improvements across the platform by identifying systemic weaknesses (single points of failure missing redundancy inadequate failover).
Collaborate with engineering squads to improve deployment practices rollback procedures and change management.
Establish and track SLOs/SLIs for critical services and report on reliability trends.

Requirements

5 years of experience in Site Reliability Engineering DevOps or Production Engineering at a technology company.
Proven experience leading production incidents end-to-end: from initial triage through cross-functional coordination to resolution and post-mortem. This is the core of the role.
Deep hands-on proficiency with Datadog including infrastructure monitoring APM log management dashboard creation and alert configuration. This is our primary observability platform and you will own it.
Strong working knowledge of Kubernetes in production specifically AWS EKS. You must be able to troubleshoot pod failures resource constraints networking issues and deployment problems independently.
Solid experience operating AWS managed services in production environments including RDS Amazon MQ and OpenSearch. You should be comfortable diagnosing performance issues understanding failover behavior and tuning configurations.
Experience with Sentry or equivalent error monitoring platforms for application-level issue detection and triage.
Proven ability to design and manage on-call rotations define escalation policies and build alerting strategies that minimize noise while catching real issues.
Experience defining and tracking SLOs/SLIs to drive reliability decisions and prioritize engineering effort.
Excellent communication skills in English both written and verbal. This role requires clear composed communication during high-pressure incidents as well as thorough post-mortem documentation.
Willingness to be on-call outside of standard working hours. This is a startup environment where the CTO and Director of Engineering are themselves on-call. This role is expected to lead that effort and share the same commitment to platform availability.

Nice to Have

Experience establishing an SRE function from scratch at a growing company.
Proficiency in Python ideally with exposure to Django or FastAPI web application stacks.
Experience with security incident response processes.
Background in B2B SaaS or platform companies.
Experience with CI/CD pipelines and deployment automation (ArgoCD or similar).

Working with Manatal

This role is based at our Bangkok office in the heart of the city center
As a team of diverse cultures and nationalities English language proficiency is required
We enjoy collaborating in person and the team is working on-site

Why youll enjoy working with Manatal

Well-funded start-up backed by Surge & Sequoia
Be part of one of the fastest-growing B2B SaaS start-ups in the region
You will be collaborating with talented individuals in the industry
Join a lively and highly international team
Indulge in a challenging and innovative working culture
Attractive medical healthcare plan
Personal development allowance
2 weeks of work from anywhere per year
Regular fun team-building activities

Hiring Process:

You can expect in our hiring process some or all of the following rounds:

Introduction interview with the HR team
Introduction interview with CTO and/or Director of Engineering
Technical Interview with technical assessment
Cultural fit interview with top management team

Our Tech Stack:

Our engineering team leverages a variety of modern technologies. While working at Manatal you can expect to be exposed to the following stack:

Infrastructure: Amazon Web Services; Kubernetes; GitOps (ArgoCD) GitHub Actions Terraform among others.
Backend:
- Language: Python (Django FastAPI);
- Databases: PostgreSQL MongoDB Open/ElasticSearch and Redis among others;
- Other: Celery RabbitMQ among others.
Frontend: Typescript (Vue React)

Learn more about Manatal:

Get to know more about us and our product:
Watch a product demo of Manatal product: Hiring: At Manatal we are dedicated to building an intuitive AI-powered application tracking system to streamline and transform the entire hiring and recruitment process.
We are devoted to attracting the best talent worldwide to join our team and take pride in being an equal-opportunity workplace. We are committed to equal employment opportunities regardless of religion race ethnicity ancestry national origin gender gender identity sex sexual orientation marital or parental status disability age citizenship or other class protected by applicable law.

Manatal is an HRTech software service (B2B SaaS) company headquartered in Bangkok Thailand. Manatal is one of the fastest-growing start-ups in the region and is backed by Surge and Sequoia Capital.Manatal has a global presence and is trusted by thousands of businesses in over 135 countries. Our goal...

Lead Senior Reliability Engineer (SRE)

Responsibilities

Incident Response

Lead all production incidents: triage coordinate response across engineering teams communicate status to stakeholders and drive resolution.
Own the incident management process end-to-end: detection escalation communication resolution and post-mortem.
Write and maintain incident runbooks to enable faster response and reduce dependency on individual knowledge.
Conduct blameless post-mortems after every significant incident. Track follow-up actions and ensure they are completed.

On-Call Operations

Own and lead the on-call rotation: define schedules escalation policies and response expectations.
Continuously improve on-call quality by reducing alert noise eliminating false positives and ensuring alerts are actionable.
Analyze on-call patterns to identify recurring issues and drive permanent fixes with engineering teams.

Observability & Monitoring

Own the observability stack: Datadog (infrastructure monitoring APM logging) and Sentry (error monitoring).
Build and maintain dashboards alerts and SLOs that provide meaningful visibility into platform health.
Ensure application performance monitoring covers critical user flows and API endpoints.
Identify gaps in observability coverage and systematically close them.

Security Incident Response

Lead the response to security incidents (data breaches unauthorized access credential compromise) in coordination with the Head of Security & Compliance.
Ensure security-related alerts are properly configured in the monitoring stack and escalation paths are clear.
Participate in penetration test remediation validation and post-incident security reviews.

Reliability & Process Improvement

Drive reliability improvements across the platform by identifying systemic weaknesses (single points of failure missing redundancy inadequate failover).
Collaborate with engineering squads to improve deployment practices rollback procedures and change management.
Establish and track SLOs/SLIs for critical services and report on reliability trends.

Requirements

5 years of experience in Site Reliability Engineering DevOps or Production Engineering at a technology company.
Proven experience leading production incidents end-to-end: from initial triage through cross-functional coordination to resolution and post-mortem. This is the core of the role.
Deep hands-on proficiency with Datadog including infrastructure monitoring APM log management dashboard creation and alert configuration. This is our primary observability platform and you will own it.
Strong working knowledge of Kubernetes in production specifically AWS EKS. You must be able to troubleshoot pod failures resource constraints networking issues and deployment problems independently.
Solid experience operating AWS managed services in production environments including RDS Amazon MQ and OpenSearch. You should be comfortable diagnosing performance issues understanding failover behavior and tuning configurations.
Experience with Sentry or equivalent error monitoring platforms for application-level issue detection and triage.
Proven ability to design and manage on-call rotations define escalation policies and build alerting strategies that minimize noise while catching real issues.
Experience defining and tracking SLOs/SLIs to drive reliability decisions and prioritize engineering effort.
Excellent communication skills in English both written and verbal. This role requires clear composed communication during high-pressure incidents as well as thorough post-mortem documentation.
Willingness to be on-call outside of standard working hours. This is a startup environment where the CTO and Director of Engineering are themselves on-call. This role is expected to lead that effort and share the same commitment to platform availability.

Nice to Have

Experience establishing an SRE function from scratch at a growing company.
Proficiency in Python ideally with exposure to Django or FastAPI web application stacks.
Experience with security incident response processes.
Background in B2B SaaS or platform companies.
Experience with CI/CD pipelines and deployment automation (ArgoCD or similar).

Working with Manatal

This role is based at our Bangkok office in the heart of the city center
As a team of diverse cultures and nationalities English language proficiency is required
We enjoy collaborating in person and the team is working on-site

Why youll enjoy working with Manatal

Well-funded start-up backed by Surge & Sequoia
Be part of one of the fastest-growing B2B SaaS start-ups in the region
You will be collaborating with talented individuals in the industry
Join a lively and highly international team
Indulge in a challenging and innovative working culture
Attractive medical healthcare plan
Personal development allowance
2 weeks of work from anywhere per year
Regular fun team-building activities

Hiring Process:

You can expect in our hiring process some or all of the following rounds:

Introduction interview with the HR team
Introduction interview with CTO and/or Director of Engineering
Technical Interview with technical assessment
Cultural fit interview with top management team

Our Tech Stack:

Our engineering team leverages a variety of modern technologies. While working at Manatal you can expect to be exposed to the following stack:

Infrastructure: Amazon Web Services; Kubernetes; GitOps (ArgoCD) GitHub Actions Terraform among others.
Backend:
- Language: Python (Django FastAPI);
- Databases: PostgreSQL MongoDB Open/ElasticSearch and Redis among others;
- Other: Celery RabbitMQ among others.
Frontend: Typescript (Vue React)

Learn more about Manatal:

Get to know more about us and our product:
Watch a product demo of Manatal product: Hiring: At Manatal we are dedicated to building an intuitive AI-powered application tracking system to streamline and transform the entire hiring and recruitment process.
We are devoted to attracting the best talent worldwide to join our team and take pride in being an equal-opportunity workplace. We are committed to equal employment opportunities regardless of religion race ethnicity ancestry national origin gender gender identity sex sexual orientation marital or parental status disability age citizenship or other class protected by applicable law.

Apply Now

About Company

Manatal Co LTD

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Lead Senior Reliability Engineer

Bangkok - Thailand

Job Summary

Lead Senior Reliability Engineer (SRE)

Responsibilities

Incident Response

On-Call Operations

Observability & Monitoring

Security Incident Response

Reliability & Process Improvement

Requirements

Nice to Have

Working with Manatal

Why youll enjoy working with Manatal

Hiring Process:

Our Tech Stack:

Learn more about Manatal:

Lead Senior Reliability Engineer (SRE)

Responsibilities

Incident Response

On-Call Operations

Observability & Monitoring

Security Incident Response

Reliability & Process Improvement

Requirements

Nice to Have

Working with Manatal

Why youll enjoy working with Manatal

Hiring Process:

Our Tech Stack:

Learn more about Manatal:

About Company

Related Jobs