Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailWe are seeking an accomplished Site Reliability Engineer (SRE) Sr Consultant to join our dynamic Observability team. In this senior role you will provide technical leadership in developing and maintaining reliable secure and cost-effective observability solutions that support our global operations.
As the Sr. consultant SRE you will serve as the strategic bridge between development and operations ensuring all systems and services are efficient highly available resilient and scalable. You will collaborate closely with software engineers system administrators and cross-functional stakeholders to drive automation optimize performance and enable seamless application delivery.
You will take end-to-end ownership of critical observability initiatives with a strong focus on availability performance security and reliability. You will lead the design and implementation of robust monitoring alerting and automation frameworks to minimize incidents and accelerate incident resolution. Your leadership will be instrumental in guiding and mentoring the team ensuring best practices are consistently adopted and operational excellence is maintained.
Key responsibilities include driving continuous improvement across processes tools and technologies leading root cause analysis and developing preventive measures for production incidents. You will champion a culture of collaboration innovation and proactive problem-solving supporting engineering teams with the technical expertise needed to meet demanding requirements.
As an integral member and leader within our Agile Scrum teams your technical acumen leadership skills and ability to mentor others will be central to delivering impactful high-quality results.
Responsibilities
Lead SRE and DevOps operations during APAC hours ensuring alignment with project objectives delivery timelines SLAs and OLAs.
Act as the primary escalation point for complex technical issues and incidents driving resolution and communicating status to leadership and stakeholders.
Provide strategic input and recommendations on SRE and DevOps initiatives to management supporting roadmap planning and resource allocation.
Coordinate and manage relationships with multiple stakeholders both internal and external across various technology domains.
Analyze production defects perform in-depth root cause analysis across code data and infrastructure and champion the implementation of long-term preventative solutions.
Mentor guide and inspire team members through technical leadership code reviews pairing and ongoing knowledge sharing.
Lead security and compliance efforts by ensuring timely application of security patches hotfixes and adherence to cybersecurity best practices.
Oversee the design deployment and continuous improvement of monitoring alerting and logging instrumentation ensuring comprehensive observability.
Architect and drive the development of automation frameworks to optimize operational efficiency eliminate manual toil and streamline system integration.
Manage and support observability platforms including Splunk ClickHouse Grafana Prometheus M3DB OpenTelemetry Fluent Bit ElasticSearch OpenSearch and CloudWatch.
Collaborate with development and product teams to design and implement scalable monitoring solutions and support the creation of reliable environments across the SDLC.
Promote and enforce DevOps and SRE best practices fostering a culture of automation reliability and continuous improvement across the organization.
Design implement and maintain robust CI CD pipelines enabling rapid reliable and automated software delivery.
Administer optimize and scale cloud infrastructure (AWS GCP) to ensure high availability performance and security.
Lead the adoption and management of infrastructure as code practices using tools such as Terraform Ansible or CloudFormation.
Continuously monitor and analyze system health proactively identifying and mitigating risks to reliability and performance.
Oversee deployment and management of containerization and orchestration solutions (Docker Kubernetes) for modern application delivery.
Drive incident management processes including leading post-incident reviews facilitating blameless postmortems and implementing actionable improvements.
Create maintain and improve detailed documentation for infrastructure processes runbooks and standard operating procedures.
Provide advanced technical support and troubleshooting guiding team members through complex infrastructure and deployment issues.
Identify propose and implement opportunities for process tooling and workflow automation to drive operational excellence.
Lead disaster recovery planning capacity management and business continuity initiatives in collaboration with cross-functional teams.
Evaluate recommend and drive the adoption of new technologies tools and practices that enhance reliability scalability and observability.
Present technical strategies incident findings and project updates to executive leadership and cross-functional stakeholders.
Foster an inclusive and collaborative team environment supporting professional growth and the continuous development of SRE best practices.
Visas Observability ecosystem includes over 2000 platform nodes utilizing approximately 15 different tools for logging monitoring and tracing alongside 80000 client agents. The system handles daily log ingestion exceeding 100TB and oversees hundreds of critical applications supporting vital alerts dashboards and reports. To maintain this high level of performance and reliability we need a Site Reliability Engineer Sr Consultant with comprehensive knowledge and practical experience. This position requires an I6.5-level engineer who can operate independently with minimal supervision.
About Visas PRE Observability Team
Visas Product Reliability Engineering (PRE) Observability team partners with Product Development as well as Operations & Infrastructure teams to build and manage innovative reliable scalable secure and cost-effective observability platform solutions. We are looking for talented Senior Site Reliability Engineers to join our driven team with a focus on maximizing system availability performance security and reliability. This dynamic role requires technical leadership strong problem-solving skills and expertise in coding testing and debugging.
This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.
Qualifications :
Basic Qualifications:
Preferred Qualifications:
Additional Information :
Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race color religion sex national origin sexual orientation gender identity disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.
Remote Work :
No
Employment Type :
Full-time
Full-time