Staff Site Reliability Engineer

Arcadia

Job Location:

Chennai - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Staff Site Reliability Engineer

Who we are:

Arcadia is the AI-powered energy intelligence platform for businesses. We replace fragmented tools and manual workflows with one platform to pay utility bills buy energy and advance sustainability across every location at enterprise scale.

Trusted by Fortune 2000 companies Arcadia combines unified data AI-powered analytics and expert advisory to help enterprise teams save money mitigate risk and cut carbon.

We deliver this through three comprehensive solutions:

Utility Bill Management: Automating the entire utility bill lifecycle from data capture and validation to payment processing and auditing.
Energy Procurement Advisory: Bringing together comprehensive data AI-powered analytics market expertise and a strong partner network to make sophisticated procurement options accessible to all. .
Sustainability Reporting Verified emissions data with seamless integration into leading sustainability platforms.

Tackling the worlds most complex energy challenges requires diverse thinking. Were building teams of people from different backgrounds industries and disciplines united by a belief that energy management should be simple intelligent and a genuine driver of business value.

What were looking for:

We are seeking a Staff Site Reliability Engineer (L4) to join our SRE/Platform Engineering team in India. This is a senior technical leadership role not people management but engineering leadership through execution mentorship and architectural ownership.

Our India SRE team is growing and this role is central to that growth. As we scale we need a technical anchor in the India timezone who can independently own multi-week SRE projects from problem statement to production make sound architectural decisions under ambiguity and elevate the team around them. You will be the person engineers lean on for design reviews debugging escalations and how should we approach this conversations. Youll bring the depth and experience to drive execution autonomously in the India timezone while collaborating closely with US-based SRE leadership on roadmap priorities incident response and platform strategy.

This is a role for someone who doesnt wait for direction you identify reliability gaps propose solutions build consensus and ship.

Our infrastructure is primarily AWS-based managed by Terraform and CloudFormation and deployed using CI/CD best your application please include a link to GitHub or another place where your code is published though we understand that not everyone has public code online.

What youll do:

Own and deliver SRE projects end-to-end from scoping and design through implementation testing rollout and documentation
Serve as a technical anchor for the India SRE team conduct design reviews pair on complex debugging and mentor engineers to develop the judgment to work through ambiguous problems independently
Design and implement infrastructure solutions across AWS (EKS VPC RDS IAM CloudWatch CloudTrail GuardDuty S3 CloudFront Lambda SQS) using Terraform and CloudFormation with an emphasis on making the right tradeoffs between speed reliability and cost
Lead Kubernetes operations including cluster upgrades capacity planning CNI troubleshooting workload scaling Helm chart packaging and GitOps deployments and build the runbooks and automation so these become repeatable rather than one-off heroics
Evolve CI/CD pipelines across Jenkins (Groovy scripting) GitHub Actions AWS CodePipeline ArgoCD and FluxCD with an emphasis on reducing manual deployment steps and improving rollback safety
Drive observability stack enhancements deliver the infrastructure and architectural direction necessary for engineering teams to leverage Prometheus Grafana and CloudWatch effectively
Identify and execute FinOps initiatives find zombie resources right-size instances enforce tagging standards and present cost-reduction recommendations with data to back them up
Manage database reliability across MySQL and PostgreSQL including backup validation performance tuning replication health failover testing and operational runbooks
Strengthen security posture through IAM least-privilege enforcement CSPM reviews GuardDuty/CloudTrail monitoring secrets management (Vault AWS Secrets Manager Parameter Store) and audit readiness
Troubleshoot complex cross-cutting production issues spanning networking Kubernetes compute databases and CI/CD and then turn the fix into a runbook or automation so the same issue doesnt require the same person next time
Write the documentation the team actually needs architecture decision records operational runbooks troubleshooting guides and post-incident action items that get closed not just filed
Collaborate daily with US-based SRE leadership on incident reviews migration planning roadmap execution and platform strategy bringing context and recommendations not just status updates
Participate in on-call rotations and drive post-incident analysis with a focus on systemic fixes over individual blame

What will help you succeed:

Must-haves:

814 years of experience in SRE/DevOps/Cloud Engineering with a demonstrated progression from task execution to project ownership were looking for evidence that you have independently scoped designed and delivered infrastructure projects end-to-end
Deep hands-on expertise with AWS EKS IAM RDS EC2 VPC CloudWatch CloudTrail GuardDuty Lambda SQS. You should be able to architect a multi-AZ multi-account solution and explain why you made the choices you made
Strong Terraform skills with experience managing complex multi-environment state writing reusable modules and reviewing others IaC for correctness and maintainability
Advanced Kubernetes knowledge you dont just deploy to K8s you troubleshoot networking issues at the CNI level tune resource requests and limits based on actual usage data and can plan and execute cluster upgrades with minimal downtime
CI/CD pipeline design and ownership across Jenkins (Groovy) GitHub Actions ArgoCD or FluxCD with a track record of improving deployment reliability and reducing manual steps
Observability stack experience with Prometheus Grafana Datadog or equivalent including defining SLOs/SLIs building meaningful dashboards and tuning alerting to reduce noise
Proven mentorship ability you have helped less experienced engineers grow. This could be formal (tech lead role code review ownership) or informal (the person everyone goes to when theyre stuck). We will ask you about this in interviews
Strong written and verbal communication skills you will interact with US-based teams daily present proposals asynchronously and write documentation that others can actually follow
Automation-first mindset your instinct when you do something manually is to immediately think about how to script it. You have a track record of reducing operational toil through scripting and tooling
Incident management experience you have led or significantly contributed to incident response and post-incident reviews in production environments and you understand the difference between fixing the symptom and fixing the system
Ability to operate with autonomy you dont need daily direction. Given a problem space and constraints you can propose an approach pressure-test it with peers and execute

Nice-to-haves:

Experience with FinOps practices cloud cost analysis rightsizing tagging governance reserved instance planning
Exposure to secrets management platforms (HashiCorp Vault AWS Secrets Manager)
Experience with event-driven architectures using AWS Lambda CloudWatch Events SQS and SNS
Exposure to AI-enabled tooling (automation assistants MCP RAG pipelines LLM-based debugging)
Experience with data warehouses (Snowflake) and their operational requirements
Experience with n8n or similar workflow automation platforms
Industry certifications AWS Solutions Architect Professional CNCF CKA/CKS HashiCorp Terraform Associate or equivalent
Experience working in a company that has grown through acquisitions with exposure to consolidating disparate infrastructure environments

Benefits:

Competitive compensation based on market standards
We are working on a hybrid model with remote first policy
Apart from Fixed Base Salary potential candidates are eligible for following benefits

Flexible Leave Policy
Office located in the heart of the city in case you need to step in for any purpose

We provide comprehensive coverage including accident policy and life insurance.

Medical Insurance (15 Family Members)
Flexible Benefit Plan
Awards and Bonus
Annual performance cycle
Quarterly engagement activities

A supportive engineering culture that values diversity empathy teamwork trust and efficiency

Eliminating carbon footprints eliminating carbon copies.

Here at Arcadia we cultivate diversity celebrate individuality and believe unique perspectives are key to our collective success in creating a clean energy future. Arcadia is committed to equal employment opportunities regardless of race color religion gender sexual orientation gender identity or expression national origin age disability genetic information protected veteran status or any status protected by applicable federal state or local law.

Thank you

Required Experience:

Staff IC

Staff Site Reliability EngineerWho we are:Arcadia is the AI-powered energy intelligence platform for businesses. We replace fragmented tools and manual workflows with one platform to pay utility bills buy energy and advance sustainability across every location at enterprise scale.Trusted by Fortune...

Staff Site Reliability Engineer

Who we are:

Trusted by Fortune 2000 companies Arcadia combines unified data AI-powered analytics and expert advisory to help enterprise teams save money mitigate risk and cut carbon.

We deliver this through three comprehensive solutions:

Utility Bill Management: Automating the entire utility bill lifecycle from data capture and validation to payment processing and auditing.
Energy Procurement Advisory: Bringing together comprehensive data AI-powered analytics market expertise and a strong partner network to make sophisticated procurement options accessible to all. .
Sustainability Reporting Verified emissions data with seamless integration into leading sustainability platforms.

What were looking for:

This is a role for someone who doesnt wait for direction you identify reliability gaps propose solutions build consensus and ship.

What youll do:

Own and deliver SRE projects end-to-end from scoping and design through implementation testing rollout and documentation
Serve as a technical anchor for the India SRE team conduct design reviews pair on complex debugging and mentor engineers to develop the judgment to work through ambiguous problems independently
Design and implement infrastructure solutions across AWS (EKS VPC RDS IAM CloudWatch CloudTrail GuardDuty S3 CloudFront Lambda SQS) using Terraform and CloudFormation with an emphasis on making the right tradeoffs between speed reliability and cost
Lead Kubernetes operations including cluster upgrades capacity planning CNI troubleshooting workload scaling Helm chart packaging and GitOps deployments and build the runbooks and automation so these become repeatable rather than one-off heroics
Evolve CI/CD pipelines across Jenkins (Groovy scripting) GitHub Actions AWS CodePipeline ArgoCD and FluxCD with an emphasis on reducing manual deployment steps and improving rollback safety
Drive observability stack enhancements deliver the infrastructure and architectural direction necessary for engineering teams to leverage Prometheus Grafana and CloudWatch effectively
Identify and execute FinOps initiatives find zombie resources right-size instances enforce tagging standards and present cost-reduction recommendations with data to back them up
Manage database reliability across MySQL and PostgreSQL including backup validation performance tuning replication health failover testing and operational runbooks
Strengthen security posture through IAM least-privilege enforcement CSPM reviews GuardDuty/CloudTrail monitoring secrets management (Vault AWS Secrets Manager Parameter Store) and audit readiness
Troubleshoot complex cross-cutting production issues spanning networking Kubernetes compute databases and CI/CD and then turn the fix into a runbook or automation so the same issue doesnt require the same person next time
Write the documentation the team actually needs architecture decision records operational runbooks troubleshooting guides and post-incident action items that get closed not just filed
Collaborate daily with US-based SRE leadership on incident reviews migration planning roadmap execution and platform strategy bringing context and recommendations not just status updates
Participate in on-call rotations and drive post-incident analysis with a focus on systemic fixes over individual blame

What will help you succeed:

Must-haves:

814 years of experience in SRE/DevOps/Cloud Engineering with a demonstrated progression from task execution to project ownership were looking for evidence that you have independently scoped designed and delivered infrastructure projects end-to-end
Deep hands-on expertise with AWS EKS IAM RDS EC2 VPC CloudWatch CloudTrail GuardDuty Lambda SQS. You should be able to architect a multi-AZ multi-account solution and explain why you made the choices you made
Strong Terraform skills with experience managing complex multi-environment state writing reusable modules and reviewing others IaC for correctness and maintainability
Advanced Kubernetes knowledge you dont just deploy to K8s you troubleshoot networking issues at the CNI level tune resource requests and limits based on actual usage data and can plan and execute cluster upgrades with minimal downtime
CI/CD pipeline design and ownership across Jenkins (Groovy) GitHub Actions ArgoCD or FluxCD with a track record of improving deployment reliability and reducing manual steps
Observability stack experience with Prometheus Grafana Datadog or equivalent including defining SLOs/SLIs building meaningful dashboards and tuning alerting to reduce noise
Proven mentorship ability you have helped less experienced engineers grow. This could be formal (tech lead role code review ownership) or informal (the person everyone goes to when theyre stuck). We will ask you about this in interviews
Strong written and verbal communication skills you will interact with US-based teams daily present proposals asynchronously and write documentation that others can actually follow
Automation-first mindset your instinct when you do something manually is to immediately think about how to script it. You have a track record of reducing operational toil through scripting and tooling
Incident management experience you have led or significantly contributed to incident response and post-incident reviews in production environments and you understand the difference between fixing the symptom and fixing the system
Ability to operate with autonomy you dont need daily direction. Given a problem space and constraints you can propose an approach pressure-test it with peers and execute

Nice-to-haves:

Experience with FinOps practices cloud cost analysis rightsizing tagging governance reserved instance planning
Exposure to secrets management platforms (HashiCorp Vault AWS Secrets Manager)
Experience with event-driven architectures using AWS Lambda CloudWatch Events SQS and SNS
Exposure to AI-enabled tooling (automation assistants MCP RAG pipelines LLM-based debugging)
Experience with data warehouses (Snowflake) and their operational requirements
Experience with n8n or similar workflow automation platforms
Industry certifications AWS Solutions Architect Professional CNCF CKA/CKS HashiCorp Terraform Associate or equivalent
Experience working in a company that has grown through acquisitions with exposure to consolidating disparate infrastructure environments

Benefits:

Competitive compensation based on market standards
We are working on a hybrid model with remote first policy
Apart from Fixed Base Salary potential candidates are eligible for following benefits

Flexible Leave Policy
Office located in the heart of the city in case you need to step in for any purpose

We provide comprehensive coverage including accident policy and life insurance.

Medical Insurance (15 Family Members)
Flexible Benefit Plan
Awards and Bonus
Annual performance cycle
Quarterly engagement activities

A supportive engineering culture that values diversity empathy teamwork trust and efficiency

Eliminating carbon footprints eliminating carbon copies.

Thank you

Required Experience:

Staff IC

Apply Now

About Company

Arcadia

Renters and homeowners, connect to a local solar farm for no extra cost and get savings on your power bill. Two minutes is all you need to subscribe.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Staff Site Reliability Engineer

Chennai - India

Job Summary

Staff Site Reliability Engineer

Who we are:

What were looking for:

What youll do:

What will help you succeed:

Must-haves:

Nice-to-haves:

Staff Site Reliability Engineer

Who we are:

What were looking for:

What youll do:

What will help you succeed:

Must-haves:

Nice-to-haves:

About Company

Related Jobs