Site Reliability Engineer - Sr. Consultant Level - Linux, Observability, Containers

Visa

Posted on : 26-06-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Bangalore - India

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 26-06-2025

Job Description

We are seeking an accomplished Site Reliability Engineer (SRE) Sr Consultant to join our dynamic Observability team. In this senior role you will provide technical leadership in developing and maintaining reliable secure and cost-effective observability solutions that support our global operations.

As the Sr. consultant SRE you will serve as the strategic bridge between development and operations ensuring all systems and services are efficient highly available resilient and scalable. You will collaborate closely with software engineers system administrators and cross-functional stakeholders to drive automation optimize performance and enable seamless application delivery.

You will take end-to-end ownership of critical observability initiatives with a strong focus on availability performance security and reliability. You will lead the design and implementation of robust monitoring alerting and automation frameworks to minimize incidents and accelerate incident resolution. Your leadership will be instrumental in guiding and mentoring the team ensuring best practices are consistently adopted and operational excellence is maintained.

Key responsibilities include driving continuous improvement across processes tools and technologies leading root cause analysis and developing preventive measures for production incidents. You will champion a culture of collaboration innovation and proactive problem-solving supporting engineering teams with the technical expertise needed to meet demanding requirements.

As an integral member and leader within our Agile Scrum teams your technical acumen leadership skills and ability to mentor others will be central to delivering impactful high-quality results.

Responsibilities

Lead SRE and DevOps operations during APAC hours ensuring alignment with project objectives delivery timelines SLAs and OLAs.
Act as the primary escalation point for complex technical issues and incidents driving resolution and communicating status to leadership and stakeholders.
Provide strategic input and recommendations on SRE and DevOps initiatives to management supporting roadmap planning and resource allocation.
Coordinate and manage relationships with multiple stakeholders both internal and external across various technology domains.
Analyze production defects perform in-depth root cause analysis across code data and infrastructure and champion the implementation of long-term preventative solutions.
Mentor guide and inspire team members through technical leadership code reviews pairing and ongoing knowledge sharing.
Lead security and compliance efforts by ensuring timely application of security patches hotfixes and adherence to cybersecurity best practices.
Oversee the design deployment and continuous improvement of monitoring alerting and logging instrumentation ensuring comprehensive observability.
Architect and drive the development of automation frameworks to optimize operational efficiency eliminate manual toil and streamline system integration.
Manage and support observability platforms including Splunk ClickHouse Grafana Prometheus M3DB OpenTelemetry Fluent Bit ElasticSearch OpenSearch and CloudWatch.
Collaborate with development and product teams to design and implement scalable monitoring solutions and support the creation of reliable environments across the SDLC.
Promote and enforce DevOps and SRE best practices fostering a culture of automation reliability and continuous improvement across the organization.
Design implement and maintain robust CI CD pipelines enabling rapid reliable and automated software delivery.
Administer optimize and scale cloud infrastructure (AWS GCP) to ensure high availability performance and security.
Lead the adoption and management of infrastructure as code practices using tools such as Terraform Ansible or CloudFormation.
Continuously monitor and analyze system health proactively identifying and mitigating risks to reliability and performance.
Oversee deployment and management of containerization and orchestration solutions (Docker Kubernetes) for modern application delivery.
Drive incident management processes including leading post-incident reviews facilitating blameless postmortems and implementing actionable improvements.
Create maintain and improve detailed documentation for infrastructure processes runbooks and standard operating procedures.
Provide advanced technical support and troubleshooting guiding team members through complex infrastructure and deployment issues.
Identify propose and implement opportunities for process tooling and workflow automation to drive operational excellence.
Lead disaster recovery planning capacity management and business continuity initiatives in collaboration with cross-functional teams.
Evaluate recommend and drive the adoption of new technologies tools and practices that enhance reliability scalability and observability.
Present technical strategies incident findings and project updates to executive leadership and cross-functional stakeholders.
Foster an inclusive and collaborative team environment supporting professional growth and the continuous development of SRE best practices.

Visas Observability ecosystem includes over 2000 platform nodes utilizing approximately 15 different tools for logging monitoring and tracing alongside 80000 client agents. The system handles daily log ingestion exceeding 100TB and oversees hundreds of critical applications supporting vital alerts dashboards and reports. To maintain this high level of performance and reliability we need a Site Reliability Engineer Sr Consultant with comprehensive knowledge and practical experience. This position requires an I6.5-level engineer who can operate independently with minimal supervision.

About Visas PRE Observability Team

Visas Product Reliability Engineering (PRE) Observability team partners with Product Development as well as Operations & Infrastructure teams to build and manage innovative reliable scalable secure and cost-effective observability platform solutions. We are looking for talented Senior Site Reliability Engineers to join our driven team with a focus on maximizing system availability performance security and reliability. This dynamic role requires technical leadership strong problem-solving skills and expertise in coding testing and debugging.

This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.

Qualifications :

Basic Qualifications:

Bachelors degree with 10-14 years of relevant professional experience

Preferred Qualifications:

Extensive hands-on experience with observability tools such as Splunk ClickHouse Grafana Prometheus M3DB OpenTelemetry Fluent Bit ElasticSearch OpenSearch and CloudWatch.
Proven ability to set up and manage exporters (e.g. Node Exporter Cert Exporter and others) for metrics collection.
Deep experience with containerization and orchestration platforms including Docker and Kubernetes.
Strong background in CI CD pipeline management using tools such as GitHub and Ansible.
Proficiency with Infrastructure as Code (IaC) technologies such as Terraform and configuration management practices like GitOps.
Advanced scripting skills in Python and or Shell within Linux environments experience with Unix scripting.
Working knowledge of query languages such as PromQL MS SQL or Splunk SPL is highly desirable.
Cloud certifications in AWS or GCP are a significant advantage.
Demonstrated ability to analyze complex technical problems and solutions and to communicate effectively at the appropriate level of detail with both technical and non-technical stakeholders.
Exceptional communication collaboration and leadership skills with a proven track record of leading and mentoring technical teams.
Strong organizational and problem-solving abilities with an aptitude for driving process improvements and operational excellence

Additional Information :

Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race color religion sex national origin sexual orientation gender identity disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.

Remote Work :

Employment Type :

Full-time

Employment Type

Full-time

Company Industry

Key Skills

Apply Now

About Company

Visa

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Site Reliability Engineer - Sr. Consultant Level - Linux, Observability, Containers

Visa

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Senior Site Reliability Engineer

Linux Administrator

Sr. DevOps Engineer

Sr. Researcher

Sr. Researcher

Sr. Researcher

Sr. Researcher

Research Fellow Sr.