Staff Site Reliability Engineer

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

The job posting is outdated and position may be filled

Job Summary

Job Description Summary

The Site Reliability Engineer will be responsible for performance and availability of Compute and Network infrastructure consumed by all business segments. The Site Reliability teams are composed of highly talented individuals obsessively focused with availability through operational excellence. The ideal individual is relentlessly technical passionate for automating everything and totally committed to delivering amazing customer experiences.

GE HealthCare is a leading global medical technology and digital solutions innovator. Our purpose is to create a world where healthcare has no limits. Unlock your ambition turn ideas into world-changing realities and join an organization where every voice makes a difference and every difference builds a healthier world.

Job Description

Roles & Responsibilities:

In this role you will:

Own manage and adapt effective monitoring and alerting systems for GEHC
Responsible for developing and managing a single pane of glass that provides for single view of GEHC ecosystem monitoring that includes top critical business applications Sites and Critical network devices.
Own Develop and manage world class monitoring data platform that ingests all the monitoring telemetric data across application / infrastructure with GEHC and integrates with AIOPS platform
Develop & product manage automated solutions / SAAS products to maintain and optimize the availability and performance of critical business processes / services and to address potential problems in the infrastructure and application ecosystem before they result in a service interruption
Ensure top critical business applications and their ecosystems are effectively monitored with appropriate alerting mechanisms integrated with event management systems for effective single Pane of Glass
Deliver self-service tools that rely on the monitoring platform / SRE example logs and statistics visualization monitoring dashboards etc.
Collaborate closely with product teams Both Internal GE product teams and Monitoring/AIOPS tool vendors to ensure that the designed solution responds to non-functional requirements such as availability performance security and maintainability. Contribute to SLI SLO and SLA definition monitoring alerting and reporting efforts.

Partner and Support other operations teams in investigating root cause of Major P1 and escalated P2 incidents through Monitoring lens
Establish performance baseline capacity thresholds correlate events and define monitoring/alerting criteria
Continuously identify patterns for a larger problem solve to avoid repeat issues.

Stay abreast of latest trends in application and infrastructure monitoring provisioning maintenance and uptime. Learn prototype and apply newest tools and best practices in real life to meet the goals of SRE practice

Education Qualification

Bachelors Degree in Computer Science or STEM Majors (Science Technology Engineering and Math) with advanced experience.
7 years of relevant experience in IT Operations/Site Reliability engineering domain and should have demonstrable expertise in architecting designing and implementing solutions for Availability and/or Performance
Comprehensive understanding in application performance monitoring cloud technologies and ability to design and implement Dynatrace solutions in complex enterprise environments.
Solid expertise in designing and implementing Dynatrace / Dynatrace extension or managing APM / observability solution.
Proficient in Dynatrace features architecture design along with installation fine-tuning and implementation experience for various environment (Production Test Development and Disaster Recovery)
Expertise in Dynatrace platform configuration including host grouping auto tagging naming rules management zones RUM (Real User Monitoring) Synthetics session properties request attributes user tags log monitoring alert profile problem notifications threshold tuning & setting up Integrations with other monitoring tools and ServiceNow.
Experience in implementing and configuring Dynatrace tools set up synthetic and transaction monitoring ensure comprehensive infrastructure and application monitoring
Create custom extension in Dynatrace using shell Python and batch script based on rest API and logs.
Setting up Dynatrace extension configurations Dashboards (including business) Infrastructure Analytics Observability logs metrics data collection and interpret the same.
Proficiency in Dynatrace Query Language (DQL) creating custom dashboards as required
Establish and foster visible architectural principles and practices to build reusable designs and systems that promote reliability velocity scale security and efficiency
Understand and improve applications and plan for faster MTTD MTTR auto healing
Understand reliability metrics and enhance automation solutions for auto-healing and incident resolution
Experience with full-stack troubleshooting skills across network application hardware management fabric or distributed services layers.
Exposure and familiarity with Agile & SRE principles automated deployments and build pipelines

Desired Characteristics:

Excellence in written and verbal communication presentation and ability to partner for success across all levels of organization and technical depths.
Enterprise logging/alerting implementations using Splunk and ELK stack Enterprise APM implementation using Dynatrace AppDynamics New Relic etc.
Excellent knowledge of common operating systems (Unix/Linux Windows)Strong oral and written communication skills.
Demonstrated experience scripting or developing software and services for the cloud Ruby Python Go etc.
Extensive knowledge of network protocols (TCP/IP SNMP FTP syslog TFTP etc.
Experience managing version control systems such as Git
Experience deploying and managing infrastructure on public clouds such as AWS or Azure
Experience using an automated configuration management system (Terraform Chef Puppet Ansible Salt etc.)
Strong organizational and project management skills
Strong analytical and problem resolution skills
Excellent knowledge of Network Management (SNMP MIB)
Experience with configuring customizing and extending monitoring tools (Datadog Sensu Grafana Splunk etc.)
Excellent knowledge of TCP/IP networking and inter-networking technologies (routing/switching proxy firewall load balancing etc.)
Knowledge and experience using Analytics Software Packages like Matlab SAS JMPro etc. Programming experience with open source scripting and data analysis packages like Python R is a plus.
Proactively engages with cross-functional teams to resolve issues and design solutions using critical thinking and analytics skills and best practices by actively incorporating input from various sources
Strong analytical and strong problem solving skills - effectively evaluates information/data to make decisions; anticipates obstacles and develops plans to resolve
Continuous improvement oriented actively generates process improvements; champions and drives change initiatives
Ability to deliver results in a rapidly changing dynamic environment
Emotional Intelligence ability to influence up and out and the ability to work independently
Must be a team player with a strong desire to win
Passionate about continuously learning and able to quickly adapt and pivot to win in dynamic environment
Highly organized and efficient; able to balance competing priorities and execute accordingly
Strong oral and written communication skills

Inclusion & Diversity

GE HealthCare is an Equal Opportunity Employer where inclusion matters. Employment decisions are made without regard to race color religion national or ethnic origin sex sexual orientation gender identity or expression age disability protected veteran status or other characteristics protected by law.
We expect all employees to live and breathe our behaviors: to act with humility and build trust; lead with transparency; deliver with focus and drive ownership always with unyielding integrity.

Our total rewards are designed to unlock your ambition by giving you the boost and flexibility you need to turn your ideas into world-changing realities. Our salary and benefits are everything youd expect from an organization with global strength and scale and youll be surrounded by career opportunities in a culture that fosters care collaboration and support.

Disclaimer: GE HealthCare will never ask for payment to process documents refer you to a third party to process applications or visas or ask you to pay costs. Never send money to anyone suggesting they can provide employment with GE HealthCare.

#LI-SM1

Additional Information

Relocation Assistance Provided: Yes

Required Experience:

Staff IC

Job Description SummaryThe Site Reliability Engineer will be responsible for performance and availability of Compute and Network infrastructure consumed by all business segments. The Site Reliability teams are composed of highly talented individuals obsessively focused with availability through oper...

Job Description Summary

Job Description

Roles & Responsibilities:

In this role you will:

Own manage and adapt effective monitoring and alerting systems for GEHC
Responsible for developing and managing a single pane of glass that provides for single view of GEHC ecosystem monitoring that includes top critical business applications Sites and Critical network devices.
Own Develop and manage world class monitoring data platform that ingests all the monitoring telemetric data across application / infrastructure with GEHC and integrates with AIOPS platform
Develop & product manage automated solutions / SAAS products to maintain and optimize the availability and performance of critical business processes / services and to address potential problems in the infrastructure and application ecosystem before they result in a service interruption
Ensure top critical business applications and their ecosystems are effectively monitored with appropriate alerting mechanisms integrated with event management systems for effective single Pane of Glass
Deliver self-service tools that rely on the monitoring platform / SRE example logs and statistics visualization monitoring dashboards etc.
Collaborate closely with product teams Both Internal GE product teams and Monitoring/AIOPS tool vendors to ensure that the designed solution responds to non-functional requirements such as availability performance security and maintainability. Contribute to SLI SLO and SLA definition monitoring alerting and reporting efforts.

Partner and Support other operations teams in investigating root cause of Major P1 and escalated P2 incidents through Monitoring lens
Establish performance baseline capacity thresholds correlate events and define monitoring/alerting criteria
Continuously identify patterns for a larger problem solve to avoid repeat issues.

Stay abreast of latest trends in application and infrastructure monitoring provisioning maintenance and uptime. Learn prototype and apply newest tools and best practices in real life to meet the goals of SRE practice

Education Qualification

Bachelors Degree in Computer Science or STEM Majors (Science Technology Engineering and Math) with advanced experience.
7 years of relevant experience in IT Operations/Site Reliability engineering domain and should have demonstrable expertise in architecting designing and implementing solutions for Availability and/or Performance
Comprehensive understanding in application performance monitoring cloud technologies and ability to design and implement Dynatrace solutions in complex enterprise environments.
Solid expertise in designing and implementing Dynatrace / Dynatrace extension or managing APM / observability solution.
Proficient in Dynatrace features architecture design along with installation fine-tuning and implementation experience for various environment (Production Test Development and Disaster Recovery)
Expertise in Dynatrace platform configuration including host grouping auto tagging naming rules management zones RUM (Real User Monitoring) Synthetics session properties request attributes user tags log monitoring alert profile problem notifications threshold tuning & setting up Integrations with other monitoring tools and ServiceNow.
Experience in implementing and configuring Dynatrace tools set up synthetic and transaction monitoring ensure comprehensive infrastructure and application monitoring
Create custom extension in Dynatrace using shell Python and batch script based on rest API and logs.
Setting up Dynatrace extension configurations Dashboards (including business) Infrastructure Analytics Observability logs metrics data collection and interpret the same.
Proficiency in Dynatrace Query Language (DQL) creating custom dashboards as required
Establish and foster visible architectural principles and practices to build reusable designs and systems that promote reliability velocity scale security and efficiency
Understand and improve applications and plan for faster MTTD MTTR auto healing
Understand reliability metrics and enhance automation solutions for auto-healing and incident resolution
Experience with full-stack troubleshooting skills across network application hardware management fabric or distributed services layers.
Exposure and familiarity with Agile & SRE principles automated deployments and build pipelines

Desired Characteristics:

Excellence in written and verbal communication presentation and ability to partner for success across all levels of organization and technical depths.
Enterprise logging/alerting implementations using Splunk and ELK stack Enterprise APM implementation using Dynatrace AppDynamics New Relic etc.
Excellent knowledge of common operating systems (Unix/Linux Windows)Strong oral and written communication skills.
Demonstrated experience scripting or developing software and services for the cloud Ruby Python Go etc.
Extensive knowledge of network protocols (TCP/IP SNMP FTP syslog TFTP etc.
Experience managing version control systems such as Git
Experience deploying and managing infrastructure on public clouds such as AWS or Azure
Experience using an automated configuration management system (Terraform Chef Puppet Ansible Salt etc.)
Strong organizational and project management skills
Strong analytical and problem resolution skills
Excellent knowledge of Network Management (SNMP MIB)
Experience with configuring customizing and extending monitoring tools (Datadog Sensu Grafana Splunk etc.)
Excellent knowledge of TCP/IP networking and inter-networking technologies (routing/switching proxy firewall load balancing etc.)
Knowledge and experience using Analytics Software Packages like Matlab SAS JMPro etc. Programming experience with open source scripting and data analysis packages like Python R is a plus.
Proactively engages with cross-functional teams to resolve issues and design solutions using critical thinking and analytics skills and best practices by actively incorporating input from various sources
Strong analytical and strong problem solving skills - effectively evaluates information/data to make decisions; anticipates obstacles and develops plans to resolve
Continuous improvement oriented actively generates process improvements; champions and drives change initiatives
Ability to deliver results in a rapidly changing dynamic environment
Emotional Intelligence ability to influence up and out and the ability to work independently
Must be a team player with a strong desire to win
Passionate about continuously learning and able to quickly adapt and pivot to win in dynamic environment
Highly organized and efficient; able to balance competing priorities and execute accordingly
Strong oral and written communication skills

#LI-SM1

Additional Information

Relocation Assistance Provided: Yes

Required Experience:

Staff IC

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

GE HealthCare

As a leading global medical technology and digital solutions innovator, GE Healthcare enables clinicians to make faster, more informed decisions through intelligent devices, data analytics, applications and services, supported by its Edison intelligence With over 100 years of healthc ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click