SRE with AWS Elastic Search Kubernetes Graphana with 10 years of experience for Bangalore.
all 5 days office
Location : Kormangla Bangalore
25LPA
Job Description
We are seeking a highly experienced Site Reliability Engineer (SRE) with 10 years of experience in designing implementing and maintaining highly available scalable and resilient systems. The ideal candidate will have deep expertise in AWS Kubernetes Elasticsearch Grafana and modern SRE practices with a strong focus on automation observability and operational excellence.
Key Responsibilities
- Design build and operate highly reliable scalable and fault-tolerant systems in AWS cloud environments.
- Implement and manage Kubernetes (EKS) clusters including deployment strategies scaling upgrades and security hardening.
- Own and improve SLIs SLOs and SLAs driving reliability through data-driven decisions.
- Architect and maintain observability platforms using Grafana Prometheus and Elasticsearch.
- Manage and optimize Elasticsearch clusters including indexing strategies performance tuning scaling and backup/restore.
- Develop and maintain monitoring alerting and logging solutions to ensure proactive incident detection and response.
- Lead incident management root cause analysis (RCA) postmortems and continuous improvement initiatives.
- Automate infrastructure and operations using Infrastructure as Code (IaC) and scripting.
- Collaborate with development teams to improve system reliability deployment pipelines and release processes.
- Implement CI/CD best practices and reduce deployment risk through canary blue-green and rolling deployments.
- Ensure security compliance and cost optimization across cloud infrastructure.
- Mentor junior SREs and drive adoption of SRE best practices across teams.
Required Skills & Qualifications
Core Technical Skills
- 10 years of experience in Site Reliability Engineering DevOps or Platform Engineering.
- Strong hands-on experience with AWS services (EC2 EKS S3 RDS IAM VPC CloudWatch Auto Scaling).
- Advanced expertise in Kubernetes (EKS preferred) Helm and container orchestration.
- Deep knowledge of Elasticsearch (cluster management indexing search optimization performance tuning).
- Strong experience with Grafana and observability stacks (Prometheus Loki ELK).
- Proficiency in Linux system administration and networking fundamentals.
- Experience with Infrastructure as Code tools (Terraform CloudFormation).
- Strong scripting skills in Python Bash or Go.
SRE with AWS Elastic Search Kubernetes Graphana with 10 years of experience for Bangalore.all 5 days officeLocation : Kormangla Bangalore25LPAJob DescriptionWe are seeking a highly experienced Site Reliability Engineer (SRE) with 10 years of experience in designing implementing and maintaining high...
SRE with AWS Elastic Search Kubernetes Graphana with 10 years of experience for Bangalore.
all 5 days office
Location : Kormangla Bangalore
25LPA
Job Description
We are seeking a highly experienced Site Reliability Engineer (SRE) with 10 years of experience in designing implementing and maintaining highly available scalable and resilient systems. The ideal candidate will have deep expertise in AWS Kubernetes Elasticsearch Grafana and modern SRE practices with a strong focus on automation observability and operational excellence.
Key Responsibilities
- Design build and operate highly reliable scalable and fault-tolerant systems in AWS cloud environments.
- Implement and manage Kubernetes (EKS) clusters including deployment strategies scaling upgrades and security hardening.
- Own and improve SLIs SLOs and SLAs driving reliability through data-driven decisions.
- Architect and maintain observability platforms using Grafana Prometheus and Elasticsearch.
- Manage and optimize Elasticsearch clusters including indexing strategies performance tuning scaling and backup/restore.
- Develop and maintain monitoring alerting and logging solutions to ensure proactive incident detection and response.
- Lead incident management root cause analysis (RCA) postmortems and continuous improvement initiatives.
- Automate infrastructure and operations using Infrastructure as Code (IaC) and scripting.
- Collaborate with development teams to improve system reliability deployment pipelines and release processes.
- Implement CI/CD best practices and reduce deployment risk through canary blue-green and rolling deployments.
- Ensure security compliance and cost optimization across cloud infrastructure.
- Mentor junior SREs and drive adoption of SRE best practices across teams.
Required Skills & Qualifications
Core Technical Skills
- 10 years of experience in Site Reliability Engineering DevOps or Platform Engineering.
- Strong hands-on experience with AWS services (EC2 EKS S3 RDS IAM VPC CloudWatch Auto Scaling).
- Advanced expertise in Kubernetes (EKS preferred) Helm and container orchestration.
- Deep knowledge of Elasticsearch (cluster management indexing search optimization performance tuning).
- Strong experience with Grafana and observability stacks (Prometheus Loki ELK).
- Proficiency in Linux system administration and networking fundamentals.
- Experience with Infrastructure as Code tools (Terraform CloudFormation).
- Strong scripting skills in Python Bash or Go.
View more
View less