Senior Operations Reliability Engineer – Cloud Infrastructure (AWS & Windows)

Chennai - India

Monthly Salary: Not Disclosed

Posted on: 3 days ago

Vacancies: 1 Vacancy

Job Summary

Genesys empowers organizations of all sizes to improve loyalty and business outcomes by creating the best experiences for their customers and employees. Through Genesys Cloud the AI-powered Experience Orchestration platform organizations can accelerate growth by delivering empathetic personalized experiences at scale to drive customer loyalty workforce engagement efficiency and operational improvements.

We employ more than 6000 people across the globe who embrace empathy and cultivate collaboration to succeed. And while we offer great benefits and perks like larger tech companies our employees have the independence to make a larger impact on the company and take ownership of their work. Join the team and create the future of customer experience together.

Overview

As a Senior Operations Reliability Engineer with a specialization in Cloud Infrastructure you will play a key role in maintaining and improving the reliability stability and operational maturity of enterprise cloud and compute environments. This role focuses primarily on AWS infrastructure with supporting responsibility for Azure and Windows-based systems.

You will lead incident detection advanced troubleshooting patching and vulnerability remediation validation and proactive reliability improvements across AWS services and Windows/Linux compute addition to hands-on operational support you will actively contribute to automation initiatives AIOps tuning and the continuous improvement of monitoring correlation and signal quality.

This role blends advanced cloud operations with reliability engineering practices including event correlation automation validation telemetry refinement and support for emerging self-healing capabilities. You will collaborate across Cloud Engineering Security IAM Network and ServiceNow teams to strengthen operational standards and accelerate automation maturity across the platform.

Responsibilities

General Reliability Operations

Resolve complex cloud and OS-related incidents through advanced troubleshooting coordinating cross-functional teams when necessary.

Monitor observability AIOps and event management platforms to detect anomalies performance degradation and emerging risks across AWS and compute systems.

Perform advanced incident triage and event correlation to determine root cause and reduce misrouted or duplicate incidents.

Lead validation of automated remediation workflows and ensure reliability of automation before production adoption.

Identify recurring manual operational tasks and translate them into automation requirements or lightweight scripted solutions.

Contribute structured operational insights telemetry improvements and signal refinement recommendations to reduce alert noise.

Lead post-incident reviews including root cause documentation and reliability improvement actions.

Ensure cloud and OS telemetry aligns with monitoring governance and CMDB standards to support accurate correlation and impact analysis.

Partner with Cloud Engineering Security IAM and Network teams to mature reliability practices and reduce operational risk.

Cloud & Windows Infrastructure Responsibilities

Troubleshoot advanced AWS operational issues including EC2 performance anomalies networking misconfigurations IAM policy conflicts storage degradation and service dependency failures.

Support Azure VM and cloud service troubleshooting where applicable ensuring cross-cloud awareness.

Perform deep OS-level diagnostics and remediation primarily within Windows Server environments with supporting responsibilities across Linux systems.

Analyze telemetry from AWS CloudWatch system logs and vulnerability management platforms to detect trends and systemic weaknesses.

Own validation and oversight of patching and vulnerability remediation workflows primarily for Windows systems with a supporting role of Linux systems ensuring compliance and reducing drift.

Improve tagging compliance IAM access hygiene backup validation and governance posture through operational enforcement and automation.

Validate and support resilience testing (backup restores failover simulations DR exercises).

Contribute to infrastructure-as-code (Terraform) enhancements.

Develop scripts (PowerShell Python CLI-based automation) to improve repeatability and reduce manual effort.

Participate in readiness planning for new AWS services infrastructure changes or architectural updates ensuring monitoring and operational support models are in place.

Provide mentorship and technical guidance to junior reliability engineers.

Automation & AIOps Contributions

Actively tune alert thresholds suppression logic and event correlation rules within AIOps and monitoring platforms.

Partner with teams to refine automated remediation logic and validate reliability before rollout.

Improve cloud signal quality by ensuring accurate metrics logs and dependency mapping across AWS services.

Contribute operational feedback to enhance predictive alerting and early detection models.

Track and support improvements in MTTR alert noise reduction patch compliance and automation coverage.

Requirements

Bachelors degree in IT or related field or equivalent experience.

5 years of experience in cloud infrastructure systems engineering or infrastructure operations roles.

Strong hands-on experience with AWS services (EC2 VPC IAM EBS S3CloudWatch networking).

Familiarity with Azure cloud environments.

Solid experience administering Windows Server; working knowledge of Linux systems.

Experience with patch management vulnerability remediation and system hardening.

Strong understanding of cloud governance principles (tagging IAM access control backups cost awareness compliance).

Experience working with monitoring observability and event management platforms.

Ability to write and modify automation scripts (PowerShell Python CLI tools YAML/JSON).

Strong troubleshooting and analytical skills with the ability to interpret complex telemetry and log data.

Experience contributing to automation initiatives or reliability improvements.

Effective communication skills for cross-functional collaboration.

Motivation to continue developing deeper skills in automation AIOps infrastructure-as-code and cloud reliability engineering.

Additional Information

Working Hours: 9:00 AM 6:00 PM IST (first shift) supporting global platform operations.

On-Call Support: Participation in a shared rotational on-call schedule is required.

#LI-GR1
#LI-Remote

If a Genesys employee referred you please use the link they sent you to apply.

About Genesys:

Genesys empowers more than 8000 organizations worldwide to create the best customer and employee experiences. With agentic AI at its core Genesys Cloud is the AI-Powered Experience Orchestration platform that connects people systems data and AI across the enterprise. As a result organizations can drive customer loyalty growth and retention while increasing operational efficiency and teamwork across human and AI workforces. To learn more visit.

Reasonable Accommodations:

If you require a reasonable accommodation to complete any part of the application process or are limited in your ability to access or use this online application and need an alternative method for applying you or someone you know may contact us at .

You can expect a response within 2448 hours. To help us provide the best support click the email link above to open a pre-filled message and complete the requested information before sending. If you have any questions please include them in your email.

This email is intended to support job seekers requesting accommodations. Messages unrelated to accommodationsuch as application follow-ups or resume submissionsmay not receive a response.

Genesys is an equal opportunity employer committed to fairness in the workplace. We evaluate qualified applicants without regard to race color age religion sex sexual orientation gender identity or expressionmarital status domestic partner statusnational origin genetics disabilitymilitary andveteran status and other protected characteristics.

Please note that recruiters will never ask for sensitive personal or financial information during the application phase.

Required Experience:

Senior IC

Overview

Responsibilities

General Reliability Operations

Resolve complex cloud and OS-related incidents through advanced troubleshooting coordinating cross-functional teams when necessary.

Monitor observability AIOps and event management platforms to detect anomalies performance degradation and emerging risks across AWS and compute systems.

Perform advanced incident triage and event correlation to determine root cause and reduce misrouted or duplicate incidents.

Lead validation of automated remediation workflows and ensure reliability of automation before production adoption.

Identify recurring manual operational tasks and translate them into automation requirements or lightweight scripted solutions.

Contribute structured operational insights telemetry improvements and signal refinement recommendations to reduce alert noise.

Lead post-incident reviews including root cause documentation and reliability improvement actions.

Ensure cloud and OS telemetry aligns with monitoring governance and CMDB standards to support accurate correlation and impact analysis.

Partner with Cloud Engineering Security IAM and Network teams to mature reliability practices and reduce operational risk.

Cloud & Windows Infrastructure Responsibilities

Troubleshoot advanced AWS operational issues including EC2 performance anomalies networking misconfigurations IAM policy conflicts storage degradation and service dependency failures.

Support Azure VM and cloud service troubleshooting where applicable ensuring cross-cloud awareness.

Perform deep OS-level diagnostics and remediation primarily within Windows Server environments with supporting responsibilities across Linux systems.

Analyze telemetry from AWS CloudWatch system logs and vulnerability management platforms to detect trends and systemic weaknesses.

Own validation and oversight of patching and vulnerability remediation workflows primarily for Windows systems with a supporting role of Linux systems ensuring compliance and reducing drift.

Improve tagging compliance IAM access hygiene backup validation and governance posture through operational enforcement and automation.

Validate and support resilience testing (backup restores failover simulations DR exercises).

Contribute to infrastructure-as-code (Terraform) enhancements.

Develop scripts (PowerShell Python CLI-based automation) to improve repeatability and reduce manual effort.

Participate in readiness planning for new AWS services infrastructure changes or architectural updates ensuring monitoring and operational support models are in place.

Provide mentorship and technical guidance to junior reliability engineers.

Automation & AIOps Contributions

Actively tune alert thresholds suppression logic and event correlation rules within AIOps and monitoring platforms.

Partner with teams to refine automated remediation logic and validate reliability before rollout.

Improve cloud signal quality by ensuring accurate metrics logs and dependency mapping across AWS services.

Contribute operational feedback to enhance predictive alerting and early detection models.

Track and support improvements in MTTR alert noise reduction patch compliance and automation coverage.

Requirements

Bachelors degree in IT or related field or equivalent experience.

5 years of experience in cloud infrastructure systems engineering or infrastructure operations roles.

Strong hands-on experience with AWS services (EC2 VPC IAM EBS S3CloudWatch networking).

Familiarity with Azure cloud environments.

Solid experience administering Windows Server; working knowledge of Linux systems.

Experience with patch management vulnerability remediation and system hardening.

Strong understanding of cloud governance principles (tagging IAM access control backups cost awareness compliance).

Experience working with monitoring observability and event management platforms.

Ability to write and modify automation scripts (PowerShell Python CLI tools YAML/JSON).

Strong troubleshooting and analytical skills with the ability to interpret complex telemetry and log data.

Experience contributing to automation initiatives or reliability improvements.

Effective communication skills for cross-functional collaboration.

Motivation to continue developing deeper skills in automation AIOps infrastructure-as-code and cloud reliability engineering.

Additional Information

Working Hours: 9:00 AM 6:00 PM IST (first shift) supporting global platform operations.

On-Call Support: Participation in a shared rotational on-call schedule is required.

#LI-GR1
#LI-Remote

If a Genesys employee referred you please use the link they sent you to apply.

About Genesys:

Reasonable Accommodations:

This email is intended to support job seekers requesting accommodations. Messages unrelated to accommodationsuch as application follow-ups or resume submissionsmay not receive a response.

Please note that recruiters will never ask for sensitive personal or financial information during the application phase.

Required Experience:

Senior IC

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

Genesys

Every year, Genesys® delivers more than 70 billion remarkable customer experiences for organizations in over 100 countries. Through the power of the cloud and AI, our technology connects every customer moment across marketing, sales and service on any channel, while also improving emp ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click