Principal SaaS Capacity Engineer

Zapopan - Mexico

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

Required Qualifications

Bachelors or Masters degree in Computer Science Electrical Engineering Cloud/Systems Engineering or a related field.
5 years of experience in cloud infrastructure SaaS operations or capacity engineering roles.
Hands-on experience with large-scale distributed systems OCI (or AWS Azure GCP) and SaaS production environments.
Strong programming and scripting experience (Python Go Shell SQL) for automation and AI/ML model deployment.
Proven experience deploying AI/ML solutions for capacity forecasting anomaly detection and intelligent workload tuning.
Deep understanding of cloud capacity topology and distributed service dependencies.
Proficiency with infrastructure-as-code (Terraform Ansible Helm Kubernetes).
Familiarity with AIOps tools and AI-driven observability platforms (Datadog Dynatrace Splunk or similar).
Knowledge of resiliency and disaster recovery strategies including AI-simulated failure modeling.

Preferred Qualifications

Advanced degree (Masters/PhD) with specialization in AI ML Data Science or distributed systems engineering.
Experience building and deploying self-healing AI-driven automation at scale in a SaaS environment.
Domain expertise in reinforcement learning applications for automated resource optimization.
Direct exposure to Oracle Cloud Infrastructure (OCI) systems and tools.
Experience with cloud-native AI/ML services MLOps and continuous model monitoring.

Competencies and Skills

Expertise in designing developing and deploying AI/ML models for cloud infrastructure use cases (demand forecasting anomaly detection workload optimization).
Advanced proficiency in automation orchestration and self-healing system architectures.
Skilled in communicating technical concepts AI-powered analytics and strategic insights to engineering and executive audiences.
Strong analytical and critical thinking skills with a deep data-driven mindset.
Curiosity and initiative to explore APIs system profiles and operational anomalies translating technical findings into impactful business outcomes.
Highly collaborative adaptive and passionate about operational excellence and continuous learning.
Ability to influence cross-team priorities and drive best practices in AI-enhanced capacity engineering.

Responsibilities

Key Responsibilities

Service Accountability: Ensure SaaS production capacity availability optimization scaling automation reserve management and quota governance.
AI/ML Integration: Apply AI/ML models for predictive capacity forecasting anomaly detection and workload auto-tuning to anticipate demand spikes and prevent outages.
Observability & AIOps: Leverage AI-powered observability and AIOps platforms for end-to-end system monitoring intelligent alerting and automated incident mitigation.
Strategic Partnership: Collaborate with Product and Development teams to design validate and align AI-driven scaling and capacity planning strategies with new launches and initiatives.
Automation & Orchestration: Design implement and optimize automation and orchestration pipelines including self-healing systems policy-driven provisioning and disaster recovery simulations using AI to enhance reliability and operational resilience.
Data-Driven Decision Support: Deliver advanced instrumentation AI-powered analytics and actionable dashboards to inform executives engineering teams and stakeholders.
Technical Leadership: Translate complex OCI stack and cloud platform resources (compute storage DB networking) into business-ready AI-enhanced capacity solutions and performance profiles.
Simulation & Resiliency: Use AI/ML models to simulate validate and improve resiliency and disaster recovery scenarios for faster more robust recovery response.
Collaboration & Communication: Present AI-driven insights risks and recommendations to engineering teams ICs and executives to illuminate capacity trends and data-driven priorities.
Continuous Innovation: Assess new AI/ML techniques AIOps platforms and automation tools for ongoing improvements in infrastructure reliability scalability and cost optimization.

Qualifications

Career Level - IC4

Required Experience:

Staff IC

DescriptionRequired QualificationsBachelors or Masters degree in Computer Science Electrical Engineering Cloud/Systems Engineering or a related field.5 years of experience in cloud infrastructure SaaS operations or capacity engineering roles.Hands-on experience with large-scale distributed systems O...

Description

Required Qualifications

Bachelors or Masters degree in Computer Science Electrical Engineering Cloud/Systems Engineering or a related field.
5 years of experience in cloud infrastructure SaaS operations or capacity engineering roles.
Hands-on experience with large-scale distributed systems OCI (or AWS Azure GCP) and SaaS production environments.
Strong programming and scripting experience (Python Go Shell SQL) for automation and AI/ML model deployment.
Proven experience deploying AI/ML solutions for capacity forecasting anomaly detection and intelligent workload tuning.
Deep understanding of cloud capacity topology and distributed service dependencies.
Proficiency with infrastructure-as-code (Terraform Ansible Helm Kubernetes).
Familiarity with AIOps tools and AI-driven observability platforms (Datadog Dynatrace Splunk or similar).
Knowledge of resiliency and disaster recovery strategies including AI-simulated failure modeling.

Preferred Qualifications

Advanced degree (Masters/PhD) with specialization in AI ML Data Science or distributed systems engineering.
Experience building and deploying self-healing AI-driven automation at scale in a SaaS environment.
Domain expertise in reinforcement learning applications for automated resource optimization.
Direct exposure to Oracle Cloud Infrastructure (OCI) systems and tools.
Experience with cloud-native AI/ML services MLOps and continuous model monitoring.

Competencies and Skills

Expertise in designing developing and deploying AI/ML models for cloud infrastructure use cases (demand forecasting anomaly detection workload optimization).
Advanced proficiency in automation orchestration and self-healing system architectures.
Skilled in communicating technical concepts AI-powered analytics and strategic insights to engineering and executive audiences.
Strong analytical and critical thinking skills with a deep data-driven mindset.
Curiosity and initiative to explore APIs system profiles and operational anomalies translating technical findings into impactful business outcomes.
Highly collaborative adaptive and passionate about operational excellence and continuous learning.
Ability to influence cross-team priorities and drive best practices in AI-enhanced capacity engineering.

Responsibilities

Key Responsibilities

Service Accountability: Ensure SaaS production capacity availability optimization scaling automation reserve management and quota governance.
AI/ML Integration: Apply AI/ML models for predictive capacity forecasting anomaly detection and workload auto-tuning to anticipate demand spikes and prevent outages.
Observability & AIOps: Leverage AI-powered observability and AIOps platforms for end-to-end system monitoring intelligent alerting and automated incident mitigation.
Strategic Partnership: Collaborate with Product and Development teams to design validate and align AI-driven scaling and capacity planning strategies with new launches and initiatives.
Automation & Orchestration: Design implement and optimize automation and orchestration pipelines including self-healing systems policy-driven provisioning and disaster recovery simulations using AI to enhance reliability and operational resilience.
Data-Driven Decision Support: Deliver advanced instrumentation AI-powered analytics and actionable dashboards to inform executives engineering teams and stakeholders.
Technical Leadership: Translate complex OCI stack and cloud platform resources (compute storage DB networking) into business-ready AI-enhanced capacity solutions and performance profiles.
Simulation & Resiliency: Use AI/ML models to simulate validate and improve resiliency and disaster recovery scenarios for faster more robust recovery response.
Collaboration & Communication: Present AI-driven insights risks and recommendations to engineering teams ICs and executives to illuminate capacity trends and data-driven priorities.
Continuous Innovation: Assess new AI/ML techniques AIOps platforms and automation tools for ongoing improvements in infrastructure reliability scalability and cost optimization.

Qualifications

Career Level - IC4

Required Experience:

Staff IC

Key Skills

Design
Academics
AutoCAD 3D
Cafe
Fabrication
Java

Apply Now

About Company

Oracle

Oracle provides the world's most complete, open, and integrated business software and hardware systems, with more than 370,000 customers—including 100 of the Fortune 100—representing a variety of sizes and industries in more than 145 countries around the globe. And Oracle's 110,000 gl ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click