Senior Site Reliability Engineer (SRE)

Pune - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

The job posting is outdated and position may be filled

Job Summary

About AppDirect

Become a digital global citizen and enable the new generation of digital entrepreneurs around the world. AppDirect offers a subscription commerce platform to sell any product through any channel on any device - as a service. We power millions of subscriptions worldwide for organizations. We do this by our values-driven culture - one that enables you to Be Seen Be Yourself and Do Your Best Work.

About the DevOps Platform Team

Our mission is to provide a robust Internal Developer Platform to AppDirects engineering teams which makes it easy safe and fun to design implement release and maintain the worlds leading subscription commerce platform. We are proud to be core contributors and maintainers of AppDirects Software Development Lifecycle (SDLC) through close alignment with Reliability Quality Data InfoSec Cloud and other technology leadership.

We enable DevOps culture through our self-service automated CI/CD platform. Currently teams are leveraging the platform to make more than 3000 code deliveries every month to 700 applications on AWS Azure and on-premise environments while remaining ISO27001 SOC2 and PCI compliant. Our Datadog instrumentation allows teams to have clear insights monitoring and alerting in order to maintain the availability of their experiences.

What youll do and how youll have an impact

Be the founding SRE for India within the DevOps Platform Team establishing operating rhythms guardrails and best practices that raise reliability across hundreds of services and 30 Kubernetes clusters.
Lead global incident management from India time zones: triage and drive resolution as Incident Commander coordinate war rooms manage stakeholder communications and publish timely status page updates.
Maintain automations to enable on-call rotations escalation policies and incident workflows in PagerDuty Datadog and Slack.
Create actionable runbooks to reduce MTTA/MTTR.
Define and operationalize SLIs/SLOs and error budgets with product and engineering teams; coach teams on using error budgets for release decisions and reliability trade-offs.
Create high-signal observability: instrument services tune alerts to reduce noise and build reliability dashboards in Datadog.
Own planned maintenance: plan and schedule maintenance windows coordinate execution across teams and environments (AWS Azure on-prem) communicate broadly and verify recovery with clear rollback plans.
Eliminate toil through automation: build ChatOps status page automation auto-remediation workflows and runbooks-as-code; integrate incident and maintenance workflows into CI/CD (Jenkins Argo).
Drive production readiness: define PRR checklists bake reliability gates into pipelines and improve deployment strategies (blue/green progressive delivery).
Partner with DevOps Platform Engineers to harden the Internal Developer Platform and improve developer experience while maintaining compliance requirements (e.g. ISO27001 SOC2 PCI).
Lead blameless postmortems track corrective actions and maintain a reliability backlog that measurably improves availability latency and change success rate.
Mentor engineers and evangelize SRE principles through documentation training and a reliability guild/community of practice.

What were looking for

4 years in SRE/Production Engineering/DevOps operating distributed systems and microservices at scale including Kubernetes and containerized workloads.
Proven incident response leadership: incident triage and coordination clear stakeholder/customer communications status page management and creation of robust runbooks.
Strong observability skills: ideally in Datadog (metrics logs traces dashboards monitors) or familiarity with Prometheus/Grafana NewRelic Dynatrace or similar tools.
Expertise designing actionable alerts tied to SLIs/SLOs and managing error budgets.
Hands-on with CI/CD and release engineering: GitHub Actions Argo (or similar) progressive delivery feature flags and safe rollout/rollback patterns.
Proficiency in at least one programming language (Golang preferred) plus Bash.
Ability to automate incident workflows status page updates and remediation tasks via APIs and ChatOps.
Solid foundations in Linux networking web protocols DNS/TLS load balancers/CDNs and performance/capacity analysis.
Experience with databases and messaging systems is a plus.
Cloud fluency in Kubernetes AWS and/or Azure understanding of multi-tenant multi-region and hybrid/on-prem environments.
Security-minded and comfortable working within compliance frameworks.
Infrastructure as Code experience (Terraform Ansible Kubernetes or similar) and Git-centric workflows.
Excellent written and verbal communication skills. Able to translate technical detail into concise business updates under pressure.
Self-starter comfortable with ambiguity and a founding-role mindset: high ownership bias for action data-driven decision making and a passion for eliminating toil.
Willingness to participate in on-call during India hours and collaborate with global teams for follow-the-sun coverage.

At AppDirect we believe that innovation thrives in an environment that houses diversity of excellence experience and thought. We respect each AppDirector as their own fingerprint; unique with no one alike. We foster an environment of inclusion without regard to race religion age sexual orientation or gender identity enabling AppDirectors to embrace their uniqueness to do their best work. As such we strongly encourage applications from Indigenous peoples racialized people people with disabilities people from gender and sexually diverse communities and/or people with intersectional identities.

At AppDirect we take privacy very seriously. For more information about our use and handling of personal data from job applicants please read our Candidate Privacy Policy. For more information of our general privacy practices please see AppDirect Privacy Notice: Experience:

Senior IC

About AppDirectBecome a digital global citizen and enable the new generation of digital entrepreneurs around the world. AppDirect offers a subscription commerce platform to sell any product through any channel on any device - as a service. We power millions of subscriptions worldwide for organizatio...

Be the founding SRE for India within the DevOps Platform Team establishing operating rhythms guardrails and best practices that raise reliability across hundreds of services and 30 Kubernetes clusters.
Lead global incident management from India time zones: triage and drive resolution as Incident Commander coordinate war rooms manage stakeholder communications and publish timely status page updates.
Maintain automations to enable on-call rotations escalation policies and incident workflows in PagerDuty Datadog and Slack.
Create actionable runbooks to reduce MTTA/MTTR.
Define and operationalize SLIs/SLOs and error budgets with product and engineering teams; coach teams on using error budgets for release decisions and reliability trade-offs.
Create high-signal observability: instrument services tune alerts to reduce noise and build reliability dashboards in Datadog.
Own planned maintenance: plan and schedule maintenance windows coordinate execution across teams and environments (AWS Azure on-prem) communicate broadly and verify recovery with clear rollback plans.
Eliminate toil through automation: build ChatOps status page automation auto-remediation workflows and runbooks-as-code; integrate incident and maintenance workflows into CI/CD (Jenkins Argo).
Drive production readiness: define PRR checklists bake reliability gates into pipelines and improve deployment strategies (blue/green progressive delivery).
Partner with DevOps Platform Engineers to harden the Internal Developer Platform and improve developer experience while maintaining compliance requirements (e.g. ISO27001 SOC2 PCI).
Lead blameless postmortems track corrective actions and maintain a reliability backlog that measurably improves availability latency and change success rate.
Mentor engineers and evangelize SRE principles through documentation training and a reliability guild/community of practice.

What were looking for

4 years in SRE/Production Engineering/DevOps operating distributed systems and microservices at scale including Kubernetes and containerized workloads.
Proven incident response leadership: incident triage and coordination clear stakeholder/customer communications status page management and creation of robust runbooks.
Strong observability skills: ideally in Datadog (metrics logs traces dashboards monitors) or familiarity with Prometheus/Grafana NewRelic Dynatrace or similar tools.
Expertise designing actionable alerts tied to SLIs/SLOs and managing error budgets.
Hands-on with CI/CD and release engineering: GitHub Actions Argo (or similar) progressive delivery feature flags and safe rollout/rollback patterns.
Proficiency in at least one programming language (Golang preferred) plus Bash.
Ability to automate incident workflows status page updates and remediation tasks via APIs and ChatOps.
Solid foundations in Linux networking web protocols DNS/TLS load balancers/CDNs and performance/capacity analysis.
Experience with databases and messaging systems is a plus.
Cloud fluency in Kubernetes AWS and/or Azure understanding of multi-tenant multi-region and hybrid/on-prem environments.
Security-minded and comfortable working within compliance frameworks.
Infrastructure as Code experience (Terraform Ansible Kubernetes or similar) and Git-centric workflows.
Excellent written and verbal communication skills. Able to translate technical detail into concise business updates under pressure.
Self-starter comfortable with ambiguity and a founding-role mindset: high ownership bias for action data-driven decision making and a passion for eliminating toil.
Willingness to participate in on-call during India hours and collaborate with global teams for follow-the-sun coverage.