Site Reliability Engineer II

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

About Backblaze

Backblaze is the object storage leader in the open cloud movement fueling customer success with cloud storage built purposefully to unlock budgets unburden administrators and unleash innovators. Together with our partners were helping customers break free from the restrictive overpriced legacy solutions that hold them back and blaze forward with the full power of the open cloud in their hands.

Founded in 2007 we scaled the business with less than $3 million in outside funding until 2021 when we did a traditional IPO on the Nasdaq stock exchange. Today Backblaze generates over $100m in revenue and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K customers in 175 countries including businesses developers IT professionals and individuals.

About the Role

We are seeking a Site Reliability Engineer II (SRE II) to help ensure the stability scalability and reliability of our services and infrastructure. This role focuses on building automation maintaining observability and supporting incident response to keep customer-facing systems performing at their best. The SRE will collaborate with engineering product and operations teams to embed reliability practices into day-to-day development and operations while contributing to tools and processes that improve efficiency and reduce manual effort.

Key Responsibilities

Service Reliability & Operations

Support the availability and durability of critical services across production environments.
Monitor service health using SLIs SLOs and error budgets and escalate issues when thresholds are at risk.
Participate in on-call rotations incident response and post-incident reviews to drive service improvements.
Follow established ITIL/OSS processes (incident change problem and capacity management).

Automation & Tooling

Develop automation for common operational tasks reducing manual intervention and toil.
Contribute to monitoring logging and alerting frameworks (e.g. Prometheus Grafana CatchpointELK).
Work with CI/CD pipelines configuration management and infrastructure as code tools (Terraform Ansible Jenkins).
Write scripts (Bash Python Go etc.) to improve system reliability and efficiency.

Collaboration

Partner with engineering product and operations teams to support resilient system design and operations.
Assist in capacity planning and disaster recovery exercises.
Work with vendors and service providers to troubleshoot service issues and track SLA performance.
Document systems share learnings and help grow a reliability-minded engineering culture.

Continuous Improvement

Contribute to playbooks runbooks and operational documentation.
Identify recurring issues and propose long-term improvements.
Promote reliability-focused practices within development and operations teams.

Qualifications

Education & Experience

Bachelors degree in Computer Science Engineering or related field (or equivalent experience).
24 years of experience in site reliability systems engineering or operations.
Exposure to large-scale production-grade systems.

Technical Skills

Solid Linux systems administration and troubleshooting skills.
Familiarity with service reliability concepts - monitoring alerting incident response and root cause analysis.
Proficiency in at least one scripting language (Python Bash or Go).
Understanding of containers (Kubernetes Docker) and microservices concepts.
Knowledge of incident response and operational best practices.

Preferred Attributes

Experience in a SaaS service provider or distributed systems environment.
Familiarity with ITIL/OSS practices and SLO/SLAs
Strong problem-solving skills and willingness to learn new technologies.
Experience with cloud platforms (AWS GCP or Azure).
Ability to work independently take ownership and drive projects from problem discovery through resolution.

At this point we hope youre feeling excited about the job description youre reading. Even if you dont meet every requirement we still encourage you to apply. Learning developing and growing are key parts of our culture. Were eager to meet people who believe in our mission and can contribute to our team in various ways. We want people to feel comfortable expressing their true selves and to come stay and do their best work here.

At Backblaze we value being fair and good to our customers partners and employees. Thats why diversity equity and inclusion are at the core of our values. We are committed to fostering a workforce where all employees feel a sense of belonging regardless of race ethnicity nationality gender sexual orientation age religion socio-economic status ability veteran status and education. We believe that our dedication to cultivating a diverse workspace not only allows us to better serve our customers in over 175 countries but further reinforces our commitment to doing the right thing. We are proud to be an Equal Opportunity Employer.

To understand more about the data we collect and process as part of your application please view our Backblaze Employee Privacy Notice.

Required Experience:

About BackblazeBackblaze is the object storage leader in the open cloud movement fueling customer success with cloud storage built purposefully to unlock budgets unburden administrators and unleash innovators. Together with our partners were helping customers break free from the restrictive overpric...

About the Role

Key Responsibilities

Service Reliability & Operations

Support the availability and durability of critical services across production environments.
Monitor service health using SLIs SLOs and error budgets and escalate issues when thresholds are at risk.
Participate in on-call rotations incident response and post-incident reviews to drive service improvements.
Follow established ITIL/OSS processes (incident change problem and capacity management).

Automation & Tooling

Develop automation for common operational tasks reducing manual intervention and toil.
Contribute to monitoring logging and alerting frameworks (e.g. Prometheus Grafana CatchpointELK).
Work with CI/CD pipelines configuration management and infrastructure as code tools (Terraform Ansible Jenkins).
Write scripts (Bash Python Go etc.) to improve system reliability and efficiency.

Collaboration

Partner with engineering product and operations teams to support resilient system design and operations.
Assist in capacity planning and disaster recovery exercises.
Work with vendors and service providers to troubleshoot service issues and track SLA performance.
Document systems share learnings and help grow a reliability-minded engineering culture.

Continuous Improvement

Contribute to playbooks runbooks and operational documentation.
Identify recurring issues and propose long-term improvements.
Promote reliability-focused practices within development and operations teams.

Qualifications

Education & Experience

Bachelors degree in Computer Science Engineering or related field (or equivalent experience).
24 years of experience in site reliability systems engineering or operations.
Exposure to large-scale production-grade systems.

Technical Skills

Solid Linux systems administration and troubleshooting skills.
Familiarity with service reliability concepts - monitoring alerting incident response and root cause analysis.
Proficiency in at least one scripting language (Python Bash or Go).
Understanding of containers (Kubernetes Docker) and microservices concepts.
Knowledge of incident response and operational best practices.

Preferred Attributes

Experience in a SaaS service provider or distributed systems environment.
Familiarity with ITIL/OSS practices and SLO/SLAs
Strong problem-solving skills and willingness to learn new technologies.
Experience with cloud platforms (AWS GCP or Azure).
Ability to work independently take ownership and drive projects from problem discovery through resolution.

To understand more about the data we collect and process as part of your application please view our Backblaze Employee Privacy Notice.

Required Experience:

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

Backblaze External Website

Backblaze is a pioneer in robust, scalable low cost cloud backup and storage services. Enterprise hot storage, low cost backup and archive, and more.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Site Reliability Engineer II

Bengaluru - India

Job Summary

About the Role

Key Responsibilities

Service Reliability & Operations

Automation & Tooling

Collaboration

Continuous Improvement

Qualifications

Education & Experience

Technical Skills

Preferred Attributes

About the Role

Key Responsibilities

Service Reliability & Operations

Automation & Tooling

Collaboration

Continuous Improvement

Qualifications

Education & Experience

Technical Skills

Preferred Attributes

Key Skills

About Company

Related Jobs