Senior Staff Cloud Backend Engineer Observability and Site Reliability

Coupand

Job Location:

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 23 hours ago

Vacancies: 1 Vacancy

Job Summary

Company Introduction

We exist to wow our customers. We know were doing the right thing when we hear our customers say How did we ever live without Coupang Born out of an obsession to make shopping eating and living easier than ever were collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.

We are proud to have the best of both worlds a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang you will see yourself your colleagues your team and the company grow every day.

Our mission to build the future of commerce is real. We push the boundaries of whats possible to solve problems and break traditional Coupang now to create an epic experience in this always-on high-tech and hyper-connected world.

Role Overview

As a Senior Staff Data Centre Observability and Site Reliability Engineer you willdesign build and operate scalable observability and reliability solutions for large-scale datacenter infrastructure. This role focuses on developing high-performance monitoring and telemetry platforms ensuring system reliability and driving operational excellence through automation performance optimization and SRE best practices. The ideal candidate will work across the full service lifecycledesign deployment and continuous improvementwhile collaborating with cross-functional teams to enhance visibility resilience and efficiency of critical systems.

What You Will Do

Observability and Monitoring

Design implement and maintain observability solutions for datacenter infrastructure including monitoring logging alerting and telemetry systems.
Develop deploy and operate large-scale observability and telemetry platforms with a focus on real-time monitoring high performance and scalability.
Own and contribute to the full lifecycle of observability servicesfrom design and development to deployment and ongoing optimization.
Build and enhance monitoring systems to ensure high availability reliability and performance of infrastructure.
Create and manage dashboards alerts and reports to provide clear visibility into system health performance and capacity trends.

Site Reliability Engineering (SRE)

Apply SRE principles and best practices to improve reliability scalability and operational efficiency of datacenter services.
Develop and maintain automation for infrastructure provisioning monitoring and system management.
Lead root cause analysis (RCA) and post-incident reviews driving corrective actions to prevent recurrence and improve system resilience.

Performance Optimization

Analyze system and application performance across the datacenter infrastructure to identify bottlenecks and improvement areas.
Implement optimization strategies to enhance performance efficiency and resource utilization.

Collaboration

Partner with cross-functional engineering teams to understand observability and reliability requirements and deliver effective solutions.
Collaborate with hardware and software vendors to evaluate integrate and optimize new technologies within the ecosystem.

Security and Compliance

Ensure observability and reliability solutions adhere to organizational security policies and industry standards.
Implement and maintain appropriate security controls to safeguard infrastructure systems and data.

Troubleshooting and Support

Provide hands-on support for observability and reliability issues including debugging complex hardware and software problems.
Develop and maintain documentation including troubleshooting guides and operational best practices to support efficient issue resolution.

Continuous Improvement

Stay current with emerging trends tools and technologies in observability and SRE and incorporate them into the platform.
Continuously enhance the scalability reliability and operational efficiency of datacenter services through proactive improvements.

Basic Qualifications

Bachelors orMasters degree in Computer Science Engineering ora relatedtechnical field.

12 years of progressive software engineering experience with a heavy emphasis on distributed systems cloud-native architectures or platform operations.
Proven experience in managing and optimizing large-scale datacenter environments

StrongproficiencyinGoorPython with a deep understanding of networked systems and performance optimization.

Expert-level knowledge ofKubernetesinternals (scheduling controllers) and containerization ecosystems.

Proven experience with load balancing service mesh and request routing at scale.
Proficiency in observability tools and technologies (e.g. Prometheus Grafana ELK Stack).
Experience with SRE practices and tools (e.g. Kubernetes Docker Terraform).
Familiarity with cloud platforms (AWS Azure GCP) and their observability and reliability services

Preferred Qualifications

Prior experience building infrastructure specifically for LLM inference or large-scale training clusters.

Familiaritywithinference including mixed precisionkernel tuning or custom hardware accelerators.

Experience managing hybrid-cloud or multi-AZdeployments across AWS Azure or GCP.

Experienceoperatingin regulated environments with strict security and compliance requirements

Type of work model

Hybrid
Our Hybrid work model: Coupang hybrid work model is designed to enable a culture of collaboration that acts a catalyst to enrich the experience of employees. Employees are required to work at least 3 days in the office per week with the flexibility to work from home 2 days a week depending on the role requirement. Some businesses may require more time in office due to nature of work.

Details to consider

Those eligible for employment protection (recipients of veterans benefits the disabled etc.) may receive preferential treatment for employment in accordance with applicable laws.

Privacy Notice

Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below. Experience:
Staff IC

Company IntroductionWe exist to wow our customers. We know were doing the right thing when we hear our customers say How did we ever live without Coupang Born out of an obsession to make shopping eating and living easier than ever were collectively disrupting the multi-billion-dollar e-commerce indu...

Company Introduction

Role Overview

What You Will Do

Observability and Monitoring

Design implement and maintain observability solutions for datacenter infrastructure including monitoring logging alerting and telemetry systems.
Develop deploy and operate large-scale observability and telemetry platforms with a focus on real-time monitoring high performance and scalability.
Own and contribute to the full lifecycle of observability servicesfrom design and development to deployment and ongoing optimization.
Build and enhance monitoring systems to ensure high availability reliability and performance of infrastructure.
Create and manage dashboards alerts and reports to provide clear visibility into system health performance and capacity trends.

Site Reliability Engineering (SRE)

Apply SRE principles and best practices to improve reliability scalability and operational efficiency of datacenter services.
Develop and maintain automation for infrastructure provisioning monitoring and system management.
Lead root cause analysis (RCA) and post-incident reviews driving corrective actions to prevent recurrence and improve system resilience.

Performance Optimization

Analyze system and application performance across the datacenter infrastructure to identify bottlenecks and improvement areas.
Implement optimization strategies to enhance performance efficiency and resource utilization.

Collaboration

Partner with cross-functional engineering teams to understand observability and reliability requirements and deliver effective solutions.
Collaborate with hardware and software vendors to evaluate integrate and optimize new technologies within the ecosystem.

Security and Compliance

Ensure observability and reliability solutions adhere to organizational security policies and industry standards.
Implement and maintain appropriate security controls to safeguard infrastructure systems and data.

Troubleshooting and Support

Provide hands-on support for observability and reliability issues including debugging complex hardware and software problems.
Develop and maintain documentation including troubleshooting guides and operational best practices to support efficient issue resolution.

Continuous Improvement

Stay current with emerging trends tools and technologies in observability and SRE and incorporate them into the platform.
Continuously enhance the scalability reliability and operational efficiency of datacenter services through proactive improvements.

Basic Qualifications

Bachelors orMasters degree in Computer Science Engineering ora relatedtechnical field.

12 years of progressive software engineering experience with a heavy emphasis on distributed systems cloud-native architectures or platform operations.
Proven experience in managing and optimizing large-scale datacenter environments

StrongproficiencyinGoorPython with a deep understanding of networked systems and performance optimization.

Expert-level knowledge ofKubernetesinternals (scheduling controllers) and containerization ecosystems.

Proven experience with load balancing service mesh and request routing at scale.
Proficiency in observability tools and technologies (e.g. Prometheus Grafana ELK Stack).
Experience with SRE practices and tools (e.g. Kubernetes Docker Terraform).
Familiarity with cloud platforms (AWS Azure GCP) and their observability and reliability services

Preferred Qualifications

Prior experience building infrastructure specifically for LLM inference or large-scale training clusters.

Familiaritywithinference including mixed precisionkernel tuning or custom hardware accelerators.

Experience managing hybrid-cloud or multi-AZdeployments across AWS Azure or GCP.

Experienceoperatingin regulated environments with strict security and compliance requirements

Type of work model

Hybrid
Our Hybrid work model: Coupang hybrid work model is designed to enable a culture of collaboration that acts a catalyst to enrich the experience of employees. Employees are required to work at least 3 days in the office per week with the flexibility to work from home 2 days a week depending on the role requirement. Some businesses may require more time in office due to nature of work.

Details to consider

Those eligible for employment protection (recipients of veterans benefits the disabled etc.) may receive preferential treatment for employment in accordance with applicable laws.

Privacy Notice

Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below. Experience:
Staff IC

Apply Now

About Company

Coupand

Join us to innovate. Rocket your career. Collaborate with teams across the globe. Find your role and learn more about our culture.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Senior Staff Cloud Backend Engineer Observability and Site Reliability

Bengaluru - India

Job Summary

Observability and Monitoring

Site Reliability Engineering (SRE)

Performance Optimization

Collaboration

Security and Compliance

Troubleshooting and Support

Continuous Improvement

Observability and Monitoring

Site Reliability Engineering (SRE)

Performance Optimization

Collaboration

Security and Compliance

Troubleshooting and Support

Continuous Improvement

About Company

Related Jobs