Lead Senior Reliability Engineer
Job Summary
Manatal is an HRTech software service (B2B SaaS) company headquartered in Bangkok Thailand. Manatal is one of the fastest-growing start-ups in the region and is backed by Surge and Sequoia Capital.
Manatal has a global presence and is trusted by thousands of businesses in over 135 countries. Our goal is to transform the entire hiring process by making it simple efficient and enjoyable for recruiters hiring managers and candidates alike. Our mission is to offer the best-in-class AI-powered technologies to empower small medium and large businesses in their staffing & recruitment transformation.
Lead Senior Reliability Engineer (SRE)
Manatal is establishing a dedicated Site Reliability Engineering function for the first time. As the platform and customer base grow the need for structured incident response mature observability and reliable on-call operations requires dedicated ownership.
The Lead Site Reliability Engineer will be the first dedicated hire in this function. Working closely with the CTO the Director of Engineering and the Director of Product this role owns incident response on-call operations and the observability stack. When production incidents occur this person leads. Between incidents they improve the systems alerting and processes that detect and prevent issues.
This is a senior individual contributor role with significant cross-functional authority. The successful candidate will ensure platform reliability and security posture. There are no direct reports initially though the function may grow over time.
Responsibilities
Incident Response
- Lead all production incidents: triage coordinate response across engineering teams communicate status to stakeholders and drive resolution.
- Own the incident management process end-to-end: detection escalation communication resolution and post-mortem.
- Write and maintain incident runbooks to enable faster response and reduce dependency on individual knowledge.
- Conduct blameless post-mortems after every significant incident. Track follow-up actions and ensure they are completed.
On-Call Operations
- Own and lead the on-call rotation: define schedules escalation policies and response expectations.
- Continuously improve on-call quality by reducing alert noise eliminating false positives and ensuring alerts are actionable.
- Analyze on-call patterns to identify recurring issues and drive permanent fixes with engineering teams.
Observability & Monitoring
- Own the observability stack: Datadog (infrastructure monitoring APM logging) and Sentry (error monitoring).
- Build and maintain dashboards alerts and SLOs that provide meaningful visibility into platform health.
- Ensure application performance monitoring covers critical user flows and API endpoints.
- Identify gaps in observability coverage and systematically close them.
Security Incident Response
- Lead the response to security incidents (data breaches unauthorized access credential compromise) in coordination with the Head of Security & Compliance.
- Ensure security-related alerts are properly configured in the monitoring stack and escalation paths are clear.
- Participate in penetration test remediation validation and post-incident security reviews.
Reliability & Process Improvement
- Drive reliability improvements across the platform by identifying systemic weaknesses (single points of failure missing redundancy inadequate failover).
- Collaborate with engineering squads to improve deployment practices rollback procedures and change management.
- Establish and track SLOs/SLIs for critical services and report on reliability trends.
Requirements
- 5 years of experience in Site Reliability Engineering DevOps or Production Engineering at a technology company.
- Proven experience leading production incidents end-to-end: from initial triage through cross-functional coordination to resolution and post-mortem. This is the core of the role.
- Deep hands-on proficiency with Datadog including infrastructure monitoring APM log management dashboard creation and alert configuration. This is our primary observability platform and you will own it.
- Strong working knowledge of Kubernetes in production specifically AWS EKS. You must be able to troubleshoot pod failures resource constraints networking issues and deployment problems independently.
- Solid experience operating AWS managed services in production environments including RDS Amazon MQ and OpenSearch. You should be comfortable diagnosing performance issues understanding failover behavior and tuning configurations.
- Experience with Sentry or equivalent error monitoring platforms for application-level issue detection and triage.
- Proven ability to design and manage on-call rotations define escalation policies and build alerting strategies that minimize noise while catching real issues.
- Experience defining and tracking SLOs/SLIs to drive reliability decisions and prioritize engineering effort.
- Excellent communication skills in English both written and verbal. This role requires clear composed communication during high-pressure incidents as well as thorough post-mortem documentation.
- Willingness to be on-call outside of standard working hours. This is a startup environment where the CTO and Director of Engineering are themselves on-call. This role is expected to lead that effort and share the same commitment to platform availability.
Nice to Have
- Experience establishing an SRE function from scratch at a growing company.
- Proficiency in Python ideally with exposure to Django or FastAPI web application stacks.
- Experience with security incident response processes.
- Background in B2B SaaS or platform companies.
- Experience with CI/CD pipelines and deployment automation (ArgoCD or similar).
Working with Manatal
- This role is based at our Bangkok office in the heart of the city center
- As a team of diverse cultures and nationalities English language proficiency is required
- We enjoy collaborating in person and the team is working on-site
Why youll enjoy working with Manatal
- Well-funded start-up backed by Surge & Sequoia
- Be part of one of the fastest-growing B2B SaaS start-ups in the region
- You will be collaborating with talented individuals in the industry
- Join a lively and highly international team
- Indulge in a challenging and innovative working culture
- Attractive medical healthcare plan
- Personal development allowance
- 2 weeks of work from anywhere per year
- Regular fun team-building activities
Hiring Process:
You can expect in our hiring process some or all of the following rounds:
- Introduction interview with the HR team
- Introduction interview with CTO and/or Director of Engineering
- Technical Interview with technical assessment
- Cultural fit interview with top management team
Our Tech Stack:
Our engineering team leverages a variety of modern technologies. While working at Manatal you can expect to be exposed to the following stack:
- Infrastructure: Amazon Web Services; Kubernetes; GitOps (ArgoCD) GitHub Actions Terraform among others.
- Backend:
- Language: Python (Django FastAPI);
- Databases: PostgreSQL MongoDB Open/ElasticSearch and Redis among others;
- Other: Celery RabbitMQ among others.
- Frontend: Typescript (Vue React)
Learn more about Manatal:
- Get to know more about us and our product:
- Watch a product demo of Manatal product: Hiring: At Manatal we are dedicated to building an intuitive AI-powered application tracking system to streamline and transform the entire hiring and recruitment process.
We are devoted to attracting the best talent worldwide to join our team and take pride in being an equal-opportunity workplace. We are committed to equal employment opportunities regardless of religion race ethnicity ancestry national origin gender gender identity sex sexual orientation marital or parental status disability age citizenship or other class protected by applicable law.