DescriptionRequired Qualifications
- Bachelors or Masters degree in Computer Science Electrical Engineering Cloud/Systems Engineering or a related field.
- 5 years of experience in cloud infrastructure SaaS operations or capacity engineering roles.
- Hands-on experience with large-scale distributed systems OCI (or AWS Azure GCP) and SaaS production environments.
- Strong programming and scripting experience (Python Go Shell SQL) for automation and AI/ML model deployment.
- Proven experience deploying AI/ML solutions for capacity forecasting anomaly detection and intelligent workload tuning.
- Deep understanding of cloud capacity topology and distributed service dependencies.
- Proficiency with infrastructure-as-code (Terraform Ansible Helm Kubernetes).
- Familiarity with AIOps tools and AI-driven observability platforms (Datadog Dynatrace Splunk or similar).
- Knowledge of resiliency and disaster recovery strategies including AI-simulated failure modeling.
Preferred Qualifications
- Advanced degree (Masters/PhD) with specialization in AI ML Data Science or distributed systems engineering.
- Experience building and deploying self-healing AI-driven automation at scale in a SaaS environment.
- Domain expertise in reinforcement learning applications for automated resource optimization.
- Direct exposure to Oracle Cloud Infrastructure (OCI) systems and tools.
- Experience with cloud-native AI/ML services MLOps and continuous model monitoring.
Competencies and Skills
- Expertise in designing developing and deploying AI/ML models for cloud infrastructure use cases (demand forecasting anomaly detection workload optimization).
- Advanced proficiency in automation orchestration and self-healing system architectures.
- Skilled in communicating technical concepts AI-powered analytics and strategic insights to engineering and executive audiences.
- Strong analytical and critical thinking skills with a deep data-driven mindset.
- Curiosity and initiative to explore APIs system profiles and operational anomalies translating technical findings into impactful business outcomes.
- Highly collaborative adaptive and passionate about operational excellence and continuous learning.
- Ability to influence cross-team priorities and drive best practices in AI-enhanced capacity engineering.
ResponsibilitiesKey Responsibilities
- Service Accountability: Ensure SaaS production capacity availability optimization scaling automation reserve management and quota governance.
- AI/ML Integration: Apply AI/ML models for predictive capacity forecasting anomaly detection and workload auto-tuning to anticipate demand spikes and prevent outages.
- Observability & AIOps: Leverage AI-powered observability and AIOps platforms for end-to-end system monitoring intelligent alerting and automated incident mitigation.
- Strategic Partnership: Collaborate with Product and Development teams to design validate and align AI-driven scaling and capacity planning strategies with new launches and initiatives.
- Automation & Orchestration: Design implement and optimize automation and orchestration pipelines including self-healing systems policy-driven provisioning and disaster recovery simulations using AI to enhance reliability and operational resilience.
- Data-Driven Decision Support: Deliver advanced instrumentation AI-powered analytics and actionable dashboards to inform executives engineering teams and stakeholders.
- Technical Leadership: Translate complex OCI stack and cloud platform resources (compute storage DB networking) into business-ready AI-enhanced capacity solutions and performance profiles.
- Simulation & Resiliency: Use AI/ML models to simulate validate and improve resiliency and disaster recovery scenarios for faster more robust recovery response.
- Collaboration & Communication: Present AI-driven insights risks and recommendations to engineering teams ICs and executives to illuminate capacity trends and data-driven priorities.
- Continuous Innovation: Assess new AI/ML techniques AIOps platforms and automation tools for ongoing improvements in infrastructure reliability scalability and cost optimization.
QualificationsCareer Level - IC4
Required Experience:
Staff IC
DescriptionRequired QualificationsBachelors or Masters degree in Computer Science Electrical Engineering Cloud/Systems Engineering or a related field.5 years of experience in cloud infrastructure SaaS operations or capacity engineering roles.Hands-on experience with large-scale distributed systems O...
DescriptionRequired Qualifications
- Bachelors or Masters degree in Computer Science Electrical Engineering Cloud/Systems Engineering or a related field.
- 5 years of experience in cloud infrastructure SaaS operations or capacity engineering roles.
- Hands-on experience with large-scale distributed systems OCI (or AWS Azure GCP) and SaaS production environments.
- Strong programming and scripting experience (Python Go Shell SQL) for automation and AI/ML model deployment.
- Proven experience deploying AI/ML solutions for capacity forecasting anomaly detection and intelligent workload tuning.
- Deep understanding of cloud capacity topology and distributed service dependencies.
- Proficiency with infrastructure-as-code (Terraform Ansible Helm Kubernetes).
- Familiarity with AIOps tools and AI-driven observability platforms (Datadog Dynatrace Splunk or similar).
- Knowledge of resiliency and disaster recovery strategies including AI-simulated failure modeling.
Preferred Qualifications
- Advanced degree (Masters/PhD) with specialization in AI ML Data Science or distributed systems engineering.
- Experience building and deploying self-healing AI-driven automation at scale in a SaaS environment.
- Domain expertise in reinforcement learning applications for automated resource optimization.
- Direct exposure to Oracle Cloud Infrastructure (OCI) systems and tools.
- Experience with cloud-native AI/ML services MLOps and continuous model monitoring.
Competencies and Skills
- Expertise in designing developing and deploying AI/ML models for cloud infrastructure use cases (demand forecasting anomaly detection workload optimization).
- Advanced proficiency in automation orchestration and self-healing system architectures.
- Skilled in communicating technical concepts AI-powered analytics and strategic insights to engineering and executive audiences.
- Strong analytical and critical thinking skills with a deep data-driven mindset.
- Curiosity and initiative to explore APIs system profiles and operational anomalies translating technical findings into impactful business outcomes.
- Highly collaborative adaptive and passionate about operational excellence and continuous learning.
- Ability to influence cross-team priorities and drive best practices in AI-enhanced capacity engineering.
ResponsibilitiesKey Responsibilities
- Service Accountability: Ensure SaaS production capacity availability optimization scaling automation reserve management and quota governance.
- AI/ML Integration: Apply AI/ML models for predictive capacity forecasting anomaly detection and workload auto-tuning to anticipate demand spikes and prevent outages.
- Observability & AIOps: Leverage AI-powered observability and AIOps platforms for end-to-end system monitoring intelligent alerting and automated incident mitigation.
- Strategic Partnership: Collaborate with Product and Development teams to design validate and align AI-driven scaling and capacity planning strategies with new launches and initiatives.
- Automation & Orchestration: Design implement and optimize automation and orchestration pipelines including self-healing systems policy-driven provisioning and disaster recovery simulations using AI to enhance reliability and operational resilience.
- Data-Driven Decision Support: Deliver advanced instrumentation AI-powered analytics and actionable dashboards to inform executives engineering teams and stakeholders.
- Technical Leadership: Translate complex OCI stack and cloud platform resources (compute storage DB networking) into business-ready AI-enhanced capacity solutions and performance profiles.
- Simulation & Resiliency: Use AI/ML models to simulate validate and improve resiliency and disaster recovery scenarios for faster more robust recovery response.
- Collaboration & Communication: Present AI-driven insights risks and recommendations to engineering teams ICs and executives to illuminate capacity trends and data-driven priorities.
- Continuous Innovation: Assess new AI/ML techniques AIOps platforms and automation tools for ongoing improvements in infrastructure reliability scalability and cost optimization.
QualificationsCareer Level - IC4
Required Experience:
Staff IC
View more
View less