Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailThe Oracle Cloud Infrastructure (OCI) Compute organization offers GPU Superclusters bare metal CPUs and virtual machines at scale to our customers. With rapid growth in machine learning the demand for GPUs is exploding making performance and efficiency of cloud scale services a critical area of investment.
The Core Architecture team partners with teams across the entire Compute organization to identify performance and efficiency constraints within the lifecycle of compute services from forecasting inventory management capacity ingestion placement repair and decommissioning. Consulting engineers are responsible for performing deep analysis of critical business problems identifying bottlenecks and proposing & incubating new architectural constructs that address the needs of some of our largest customers. These solutions could take the shape of new microservices or restructuring of the control plane services and dataflow.
You will take the lead in defining the architecture for the brand-new host state management engine that will power the next generation of the Compute Control Plane. This initiative spans across multiple Compute domains from GPU validation to repairs and you will drive engineers from these organizations to build microservice based solutions that will enable Compute to scale for growing customer demands.
We are looking for a hands-on senior principal engineer with technical breadth proven experience in solving cloud scale problems distributed systems design & implementation experience to build fault tolerant solutions that will form the foundations of the next generation of Compute offerings. The candidate is expected to have strong written and verbal communications skills the ability to lead projects across organizational boundaries and experience representing their work to senior leadership.
Career level-IC5
As a Consulting Member of Technical Staff you will lead the definition and evolution of cloud scale services using a distributed microservices based architecture. You will define software development best practices within your organization to develop and deploy high quality software at a rapid pace. You will identify business KPIs for your software and iteratively build impactful solutions that solve hard customer problems. You will be responsible for hands-on software design development and debugging in a cloud native environment.
Qualifications:
BS or MS degree in Computer Science/Engineering or a related IT field or equivalent experience relevant to functional area.
10 years of development experience with large scale highly available distributed systems
Proficiency with Cloud-based Data Store primitives
Proficiency in Java programming patterns
Experience with operating distributed services at scale
Expertise in Linux and operating systems
Systematic problem-solving approach strong communication skills strong ownership and drive
Deep understanding of service metrics and alarms through the development of dashboards service KPIs alarming systems
Propose scope design and direct automation optimizations and enhancements
Mentor junior engineers
Preferred Qualifications:
Experience in management and automation of end-to-end CPU/GPU lifecycles at scale
Experience in building large scale control planes or distributed workflows.
Proficiency with Cloud and CICD environments
Proficiency with modern build tools and pipelines
Proficiency building multi-tenant virtualized infrastructure
Proficiency with change control management and mature operating processes
Proficiency with Security including Identity SSL and certificates
Proficiency with Database and Data Stores
Career Level - IC5
Full-Time