Own and scale mission-critical ERP/SaaS services while building intelligent cloud-native capabilities. This role requires a SRE mindset combined with AI/ML expertise and strong application engineering skills across public and private cloud environments.
Key Responsibilities
- End-to-end service ownership: design for telemetry security resiliency scalability and performance; lead sizing/architecture; drive service health reviews and process simplification.
- Incident management and prevention: lead postmortems/RCAs coordinate fixes define repair items and implement data-driven prevention and continuous improvement.
- AI/ML and GenAI delivery: design and integrate solutions with LLMs RAG agentic workflows and conversational AI; build low-latency model serving and retraining pipelines.
- Application engineering: develop performant microservices for distributed containerized cloud-native systems.
- Automation: eliminate toil by automating operational workflows recovery procedures code delivery and configuration management; build internal tools and reusable scripts/services to accelerate delivery and reduce errors.
- Observability: define and implement monitoring logging alerting and tracing strategies; establish SLOs/SLIs/error budgets; improve diagnostics and performance visibility for rapid triage.
- Cross-functional collaboration: partner with product operations and data teams to translate requirements into secure scalable solutions; communicate effectively with technical and non-technical stakeholders.
Minimum Qualifications
- BS/MS in Computer Science or related field; 10 years of software engineering in cloud environments.
- Strong in distributed systems/microservices using java / python; SQL/data modeling; python for AI/automation.
- SRE/DevOps expertise: systems and networking fundamentals application security observability performance analysis and incident response.
- Proven SDLC excellence: code quality reviews version control CI/CD testing and release engineering.
- Excellent written and verbal communication; English fluency.
Preferred/Technical Skills
- AI/ML/GenAI: experience with foundational models RAG agentic architectures; model deployment optimization monitoring and retraining.
- Cloud and containers: experience with containerization orchestration and resilient fault-tolerant microservices.
- Observability: hands-on experience designing dashboards alerts traces logs and metrics; defining SLOs/SLIs and error budgets; on-call readiness and runbook quality.
- Operations: performance tuning across java / python and SQL for large-scale enterprise applications; strong Linux/Unix expertise; capacity planning and reliability reviews.
- Automation and scripting: proficiency in scripting to automate operational workflows build tooling and CI/CD tasks (e.g. shell scripting python configuration-as-code task runners).
- Familiarity with enterprise ERP applications and standard DevOps tooling and practices.
Career Level - IC4
Required Experience:
Staff IC
As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity. We know that true innovation starts when eve ... View more