Staff Site Reliability Engineer (8+ years)

Bangalore - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Expert-level proficiency operating large-scale distributed mission-critical systems: designing for high availability multi-region resiliency low latency and predictable performance under extreme load.
SRE fundamentals at Staff level: defines and drives SLOs/SLIs error budgets availability targets and capacity guardrails codifies reliability requirements into design reviews and change-management gates.
Deep hands-on with Kubernetes and container platforms: multi-cluster operations workload placement HPA/VPA pod disruption budgets network policies admission control service mesh (Istio/Linkerd) and progressive delivery (blue/green canary feature flags).
Infra as Code and GitOps: Terraform (and/or Pulumi) Helm/Kustomize Argo CD/Flux builds reusable modules policy-as-code (OPA/Conftest) environment drift detection and automated remediation.
Observability at scale: OpenTelemetry instrumentation/tracing metrics (Prometheus) logging (ELK/OpenSearch) distributed tracing (Jaeger/Tempo/Zipkin) dashboards and SLO burn-rate alerts (Grafana) designs actionable alerts with runbook automation.
Proven incident leadership: serves as Incident Commander for P0/P1 events coordinates cross-functional response stabilizes systems restores service quickly and drives blameless postmortems with measurable follow-through.
Performance engineering and capacity planning: load and resilience testing GC/heap and thread tuning (for JVM services) profiling (CPU memory IO) caching strategies queue backpressure and cost-aware capacity models.
Strong systems and networking: Linux internals filesystems TCP/UDP TLS/mTLS HTTP/2/3 DNS BGP/Anycast concepts L4L7 load balancing (Envoy/HAProxy/NGINX) CDN/edge (Cloudflare/Fastly/Akamai) WAF and DDoS mitigation.
Data/store reliability: operational experience with relational (PostgreSQL/MySQL/Oracle) and NoSQL (Cassandra/DynamoDB/MongoDB) streaming platforms (Kafka/Pulsar/Kinesis) and distributed caches (Redis/Hazelcast) backup/restore consistency models compaction/retention tuning and multi-AZ/region failover.
Cloud and platform engineering: AWS/Azure/GCP core services VPC design IAM/RBAC KMS secrets management (Vault) service catalog golden images/base containers and paved-road platforms for developers.
Release engineering and CI/CD: Jenkins/GitHub Actions/GitLab CI artifact/signing/SBOM canary analysis automated rollbacks deployment safety checks and change failure rate/MTTR improvements.
Reliability-by-design partnership: participates in and leads architecture/design reviews threat modeling and resilience patterns (bulkheads circuit breakers idempotency retry/backoff dead-letter handling).
Disaster recovery and business continuity: RTO/RPO objectives runbooks game days/chaos experiments (Litmus/Gremlin) regional evacuation and active-active/active-passive strategies.
Security in depth for production systems: least privilege workload identity image and dependency scanning supply-chain hardening (SLSA) SBOM network segmentation/zero trust and PCI-DSS-aligned operational controls.
Strong programming and automation: production-grade Go and/or Python (plus Bash) contributing SRE tooling controllers/operators and APIs code reviews testing and docs-as-code.
Effective communicator and influencer: aligns reliability strategy with business outcomes mentors engineers challenges assumptions with data and proposes pragmatic incremental improvements.
Experience leveraging GenAI/LLMs as copilots: accelerating runbook authoring alert triage knowledge retrieval and post-incident synthesis with appropriate guardrails and data security.
Nice to have: JVM and runtime tuning experience traffic engineering at Internet scale mobile edge/network reliability considerations.

This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.

Qualifications :

Basic Qualifications
5 years of relevant work experience with a Bachelors Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters MBA JD MD) or 0 years of work experience with a PhD OR 8 years of relevant work experience.

Preferred Qualifications
5 years of relevant work experience with a Bachelors Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters MBA JD MD) or 0 years of work experience with a PhD OR 8 years of relevant work experience.
Demonstrated ownership of SLOs/error budgets and production change risk management for tier-1 services.
Production experience with Kubernetes at scale service mesh and at least one major cloud provider (AWS/Azure/GCP).
Proficiency with Terraform and GitOps workflows strong coding skills in Go and/or Python.
Hands-on with observability stacks (Open Telemetry Prometheus/Grafana ELK/OpenSearch one commercial APM/log platform).
Track record as Incident Commander and author of high-quality postmortems that drove systemic fixes.
Experience with streaming platforms (Kafka/Pulsar) distributed datastores (Cassandra/DynamoDB) and caching (Redis).
Familiarity with PCI-DSS or similarly stringent compliance environments.
Excellent communication stakeholder management and mentoring abilities.

Additional Information :

Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race color religion sex national origin sexual orientation gender identity disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.

Remote Work :

Employment Type :

Full-time

Expert-level proficiency operating large-scale distributed mission-critical systems: designing for high availability multi-region resiliency low latency and predictable performance under extreme load. SRE fundamentals at Staff level: defines and drives SLOs/SLIs error budgets availability targets an...

Expert-level proficiency operating large-scale distributed mission-critical systems: designing for high availability multi-region resiliency low latency and predictable performance under extreme load.
SRE fundamentals at Staff level: defines and drives SLOs/SLIs error budgets availability targets and capacity guardrails codifies reliability requirements into design reviews and change-management gates.
Deep hands-on with Kubernetes and container platforms: multi-cluster operations workload placement HPA/VPA pod disruption budgets network policies admission control service mesh (Istio/Linkerd) and progressive delivery (blue/green canary feature flags).
Infra as Code and GitOps: Terraform (and/or Pulumi) Helm/Kustomize Argo CD/Flux builds reusable modules policy-as-code (OPA/Conftest) environment drift detection and automated remediation.
Observability at scale: OpenTelemetry instrumentation/tracing metrics (Prometheus) logging (ELK/OpenSearch) distributed tracing (Jaeger/Tempo/Zipkin) dashboards and SLO burn-rate alerts (Grafana) designs actionable alerts with runbook automation.
Proven incident leadership: serves as Incident Commander for P0/P1 events coordinates cross-functional response stabilizes systems restores service quickly and drives blameless postmortems with measurable follow-through.
Performance engineering and capacity planning: load and resilience testing GC/heap and thread tuning (for JVM services) profiling (CPU memory IO) caching strategies queue backpressure and cost-aware capacity models.
Strong systems and networking: Linux internals filesystems TCP/UDP TLS/mTLS HTTP/2/3 DNS BGP/Anycast concepts L4L7 load balancing (Envoy/HAProxy/NGINX) CDN/edge (Cloudflare/Fastly/Akamai) WAF and DDoS mitigation.
Data/store reliability: operational experience with relational (PostgreSQL/MySQL/Oracle) and NoSQL (Cassandra/DynamoDB/MongoDB) streaming platforms (Kafka/Pulsar/Kinesis) and distributed caches (Redis/Hazelcast) backup/restore consistency models compaction/retention tuning and multi-AZ/region failover.
Cloud and platform engineering: AWS/Azure/GCP core services VPC design IAM/RBAC KMS secrets management (Vault) service catalog golden images/base containers and paved-road platforms for developers.
Release engineering and CI/CD: Jenkins/GitHub Actions/GitLab CI artifact/signing/SBOM canary analysis automated rollbacks deployment safety checks and change failure rate/MTTR improvements.
Reliability-by-design partnership: participates in and leads architecture/design reviews threat modeling and resilience patterns (bulkheads circuit breakers idempotency retry/backoff dead-letter handling).
Disaster recovery and business continuity: RTO/RPO objectives runbooks game days/chaos experiments (Litmus/Gremlin) regional evacuation and active-active/active-passive strategies.
Security in depth for production systems: least privilege workload identity image and dependency scanning supply-chain hardening (SLSA) SBOM network segmentation/zero trust and PCI-DSS-aligned operational controls.
Strong programming and automation: production-grade Go and/or Python (plus Bash) contributing SRE tooling controllers/operators and APIs code reviews testing and docs-as-code.
Effective communicator and influencer: aligns reliability strategy with business outcomes mentors engineers challenges assumptions with data and proposes pragmatic incremental improvements.
Experience leveraging GenAI/LLMs as copilots: accelerating runbook authoring alert triage knowledge retrieval and post-incident synthesis with appropriate guardrails and data security.
Nice to have: JVM and runtime tuning experience traffic engineering at Internet scale mobile edge/network reliability considerations.

This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.

Qualifications :

Additional Information :

Remote Work :

Employment Type :

Full-time

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

Visa

Visa (NYSE: V) is a world leader in digital payments, facilitating transactions between consumers, merchants, financial institutions and government entities across more than 200 countries and territories. Our purpose is to uplift everyone, everywhere by being the best way to pay and b ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click