Senior Site Reliability Engineer

Oracle

Job Location:

Pleasanton, CA - USA

Yearly Salary: $ 81100 - 187000

Posted on: 3 days ago

Vacancies: 1 Vacancy

Job Summary

Description

We are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role focuses on improving service reliability reducing operational risk automating repetitive tasks and driving faster detection and resolution of issues.

The engineer will work closely with development infrastructure security and operations teams to monitor service health troubleshoot production issues participate in incident response improve observability and implement reliability best practices. This role also includes analyzing recurring failures building automation supporting deployments and contributing to capacity planning disaster recovery and operational readiness.

Also works on number of different region/realm rollouts deployments. Forecasts demands and responds to capacity needs. Collaborates with software development teams to develop reliable and scalable infrastructures. Performs data collection to maintain and optimize operations and reliability. Leverages knowledge to perform incident response and/or maintenance tasks. Provides health and performance reporting. Identifies opportunities for automation. Communicates about services and identifies and explains the potential impact of changes. Provides support for technology and document incidents. Experiments with new tools and assesses potential impact and develops knowledge of site reliability trends.

Responsibilities

Key Responsibilities
Capacity Ingestion and Management:
-Takes proactive steps to design and architect infrastructure and/or service according to terms for reliability and functionality.
-Forecasts demands for infrastructure and responds to capacity needs ensuring systems have sufficient resources to handle current and future workloads.
-Collaborates with the software development team to develop infrastructures and features that are reliable and scalable according to deployment requirements.
-Independently identifies opportunities for and drives prototyping (e.g. testing new applications or infrastructures assisting in onboarding).
Incident and Service Lifecycle Management:
-Performs data collection triage technical analysis and redirection to maintain and optimize operations and infrastructure reliability.
-Independently monitors services maintains up-to-date knowledge of their performance and documents their condition.
-Leverages comprehensive knowledge to perform incident response root cause analyses and/or maintenance on assigned services (e.g. software installs version upgrades security updates backup and recovery).
-Provides health and performance reporting and takes appropriate actions based on trends in data.
-May independently perform provisioning to support infrastructure applications and services.
-May perform standard and non-standard decommissioning (e.g. shutting down servers removing data from databases) to remove objects that are no longer needed.
Automation:
-Identifies opportunities for automation and assesses potential benefits.
-Develops automation tools or scripts to provide solutions gather metrics monitor analyze mitigate or remediate issues/defects within infrastructures.
-Independently conducts testing to ensure automation performs the task correctly and produces expected results.
Technical Communication and Guidance:
-Communicates the scale capacity security performance attributes and requirements of services and technology within and sometimes beyond immediate team.
-Identifies and explains the potential impact of infrastructure feature and tool changes considering their impact on team operations.
Troubleshooting and Resolution:
-Provides operational support for technology escalating incidents and other standard and non-standard issues arising within Oracle services.
-Participates in on-call shifts to address issues.
-Resolves technical issues spanning various services investigating and debugging products in order to reach SLOs (service level objectives).
-Documents incidents and performs root cause analyses according to standard reporting methods.
-Independently performs post-mortem procedures to prevent incident reoccurrence.
Innovation and Improvement:
-Experiments with new tools and technologies to assess their potential impact on and improve infrastructure performance and reliability ensuring adherence to security standards.
-Independently identifies and executes improvements for performance bottlenecks and deployments to ensure efficient resource usage speed and scalability.
-Develops knowledge of site reliability trends and shares new information with team members management and beyond to help others build test deploy and run services.
-Performs standard and non-standard analyses and provides clear data on production to contribute to business development decisions (e.g. design changes).

Core Responsibilities
Planning & Execution:
Independently manages work monitoring timelines and deliverables to ensure projects or initiatives stay on track and meet requirements. Proactively prioritizes work and adapts to resource or timeline shifts suggesting adjustments to maintain project efficiency.
Collaboration & Partnership:
Collaborates across teams to align on expectations and achieve shared objectives. Builds and maintains a comprehensive understanding of business stakeholder and/or customer needs to build and support effective partnerships. Actively listens to diverse perspectives and asks questions to ensure understanding of others.
Problem Solving:
Independently identifies and addresses standard and non-standard issues in accordance with standard practices escalating more complex issues as appropriate. Analyzes data and/or information from multiple sources to troubleshoot standard and non-standard errors. Contributes to knowledge sharing and best practices.
Continuous Learning:
Embraces continuous learning by actively seeking to build knowledge and new skills and/or tools and staying current with industry trends and best practices. Seeks out and leverages feedback and training to improve skills. Contributes to a culture of continuous learning and knowledge sharing with team members.
Continuous Improvement:
Develops ideas and recommends updates to increase the efficiency and effectiveness of processes protocols and workflows within a team. Seeks input from team members on alternative approaches and methods for improving work.

IAC: Terraform Chef Ansible

Languages: Python Java Bash

Orchestration: Kubernetes Helm

CI/CD: Jenkins

Observability: Grafana Prometheus

Qualifications
Disclaimer:

Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements such as immunization/occupational health mandates and/or drug testing requirements.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $81100 to $187000 per annum. May be eligible for bonus and equity.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge skills experience market conditions and locations as well as reflect Oracles differing products industries and lines of business.
Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:
1. Medical dental and vision insurance including expert medical opinion
2. Short term disability and long term disability
3. Life insurance and AD&D
4. Supplemental life insurance (Employee/Spouse/Child)
5. Health care and dependent care Flexible Spending Accounts
6. Pre-tax commuter and parking benefits
7. 401(k) Savings and Investment Plan with company match
8. Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
9. 11 paid holidays
10. Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
11. Paid parental leave
12. Adoption assistance
13. Employee Stock Purchase Plan
14. Financial planning and group legal
15. Voluntary benefits including auto homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - IC3

Required Experience:

Senior IC

DescriptionWe are looking for a Site Reliability Engineer 3 to support mission-critical cloud services and production operations. The role focuses on improving service reliability reducing operational risk automating repetitive tasks and driving faster detection and resolution of issues.The engineer...

Description

Responsibilities

IAC: Terraform Chef Ansible

Languages: Python Java Bash

Orchestration: Kubernetes Helm

CI/CD: Jenkins

Observability: Grafana Prometheus

Career Level - IC3

Required Experience:

Senior IC

Apply Now

About Company

Oracle

As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity. We know that true innovation starts when eve ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click