Operate and optimize Oracle Database and Exadata environments to meet stringent availability performance and scalability targets in 24x7 production.
Lead database reliability engineering initiatives including HA design patterns capacity planning demand forecasting and performance analysis/system tuning.
Deliver advanced performance tuning (SQL optimization indexing strategies configuration and storage tuning) and drive measurable improvements in latency throughput and stability.
Design and maintain backup recovery and disaster recovery strategies; validate restore procedures and ensure readiness for mission-critical environments.
Apply SRE best practices including defining SLIs/SLOs managing error budgets and improving incident response through post-incident reviews and durable corrective actions.
Build automation and tools (Python/Shell/PowerShell) to eliminate toil reduce MTTR improve deployment reliability and prevent recurring incidents.
Instrument and enhance observability using monitoring/APM stacks (e.g. Prometheus Grafana APM) to improve signal quality and reduce alert noise.
Partner with engineering and architecture teams on service and database design data modeling decisions and system architecture improvements for distributed systems.
Education: Bachelors or Masters degree in Computer Science Engineering or related field (or equivalent practical experience).
Experience: 6 years in SRE Cloud Engineering DevOps Database Reliability or similar production-operations engineering roles.
Oracle Database expertise: Expert hands-on experience with Oracle Database and Exadata administration high availability architectures and production operations.
Performance tuning: Demonstrated capability in SQL tuning indexing strategies resource utilization analysis and system tuning for high-scale workloads.
Backup/DR: Proven experience designing and operating backup recovery and disaster recovery solutions for 24x7 mission-critical systems.
Automation/scripting: Strong hands-on proficiency in Python and/or Shell/PowerShell for automation tooling and operational workflows.
Reliability & distributed systems: Solid understanding of cloud concepts distributed systems behaviors and SRE fundamentals (SLIs/SLOs incident response RCA).
Operational excellence: Strong troubleshooting analytical thinking and clear communication skills; comfortable acting as an escalation point during critical incidents.
Good-to-Have
Cloud platforms: OCI preferred; AWS/Azure/GCP experience also valuable.
IaC & configuration management: Terraform Ansible and Infrastructure-as-Code best practices.
Containers: Kubernetes and Docker exposure in production environments.
Observability depth: Experience with database observability APM tooling tracing and alert quality/noise reduction initiatives.
AI familiarity: Exposure to LLMs RAG or AI agents (especially in operational tooling/automation contexts).
Certifications: Oracle Database/Exadata OCI (or other cloud architect) SRE/DevOps-related certifications.
Self-Assessment Questions
Have I owned production Oracle Database/Exadata environments and successfully improved availability or performance through concrete tuning or architecture changes
Can I confidently diagnose performance issues end-to-end (SQL indexing configuration storage and workload characteristics) and explain tradeoffs to stakeholders
Have I designed and validated backup/restore and DR processes (including regular testing) for systems that require 24x7 reliability
Do I routinely build automation in Python/Shell/PowerShell to reduce manual operational work improve MTTR or prevent recurring incidents
Am I comfortable applying SRE practices (SLIs/SLOs error budgets incident response RCA/postmortems) and driving improvements across teams
Career Level - IC3
As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity. We know that true innovation starts when eve ... View more