Lead Support Analyst Shared Services and Production Management , Information Technology (MNC Institutional Brokerage & Investment Services)

Riverforest Connections Private Limited

Job Location:

Pune - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Key Areas of Responsibilities

Own and support monitoring and SRE operations ensuring system reliability availability and performance.
Build enhance and maintain monitoring solutions using ITRS Geneos Prometheus VictoriaMetrics Elasticsearch and Grafana.
Develop optimize and maintain alerting rules dashboards and observability pipelines.
Troubleshoot and resolve complex issues during major incidents providing clear and timely communication.
Troubleshoot Linux servers (RHEL 7/8/9) including upgrades configurations patching and maintenance while determining appropriate monitoring requirements for system changes.
Analyze logs investigate issues and perform fault finding to identify performance exceptions.
Collaborate with engineering application and infrastructure teams to improve system resilience stability security efficiency and scalability.
Contribute to automation strategies deployment processes and continuous operational improvements.
Participate in oncall rotations including offhours and scheduled weekend support.
Participate in Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
Continuously research and adopt modern monitoring and SRE tools and practices.

Requirements

Strong experience with monitoring and observability platforms including: ITRS Geneos Prometheus VictoriaMetrics Elasticsearch Grafana and Kibana.
Hands-on experience building and implementing Prometheus pipelines including exporters scraping configurations relabelling metric routing and integrations with longterm storage (e.g. VictoriaMetrics).
Experience building and maintaining Logstash pipelines including ingestion parsing filtering enrichment and routing of logs into Elasticsearch.
Ability to design build and maintain Grafana and Kibana dashboards for metrics logs and performance analytics across distributed systems.
Solid understanding of metrics logging alerting dashboards and observability pipelines.
Strong Linux administration skills (RHEL 7/8/9) including troubleshooting upgrades configuration patching and performance optimization.
Good understanding of SRE principles high availability scalability incident management and DR (Disaster Recovery) / BCP (Business Continuity Planning) activities
Experience with automation (e.g. Bash Python Ansible CI/CD tools) is an advantage.
Understanding of networking fundamentals performance tuning and troubleshooting distributed systems.
Prior experience in Production Support SRE Monitoring Engineering or Shared Services Operations with participation in oncall rotations including after-hours and weekend support.
Strong analytical problemsolving and communication skills with the ability to work collaboratively under pressure.
Self-motivated adaptable and able to prioritize learn continuously and manage multiple responsibilities effectively.

Candidates profile:

Bachelors degree in computer science / engineering
Minimum 8 years experience within IT / Investment bank.
Excellent/Fluent in English

Required Skills:

SRE OperationsITRSGeneosPrometheusVictoriaMetrics

Key Areas of Responsibilities Own and support monitoring and SRE operations ensuring system reliability availability and performance. Build enhance and maintain monitoring solutions using ITRS Geneos Prometheus VictoriaMetrics Elasticsearch and Grafana. Develop optimize and maintain alerting rules ...

Key Areas of Responsibilities

Own and support monitoring and SRE operations ensuring system reliability availability and performance.
Build enhance and maintain monitoring solutions using ITRS Geneos Prometheus VictoriaMetrics Elasticsearch and Grafana.
Develop optimize and maintain alerting rules dashboards and observability pipelines.
Troubleshoot and resolve complex issues during major incidents providing clear and timely communication.
Troubleshoot Linux servers (RHEL 7/8/9) including upgrades configurations patching and maintenance while determining appropriate monitoring requirements for system changes.
Analyze logs investigate issues and perform fault finding to identify performance exceptions.
Collaborate with engineering application and infrastructure teams to improve system resilience stability security efficiency and scalability.
Contribute to automation strategies deployment processes and continuous operational improvements.
Participate in oncall rotations including offhours and scheduled weekend support.
Participate in Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
Continuously research and adopt modern monitoring and SRE tools and practices.

Requirements

Strong experience with monitoring and observability platforms including: ITRS Geneos Prometheus VictoriaMetrics Elasticsearch Grafana and Kibana.
Hands-on experience building and implementing Prometheus pipelines including exporters scraping configurations relabelling metric routing and integrations with longterm storage (e.g. VictoriaMetrics).
Experience building and maintaining Logstash pipelines including ingestion parsing filtering enrichment and routing of logs into Elasticsearch.
Ability to design build and maintain Grafana and Kibana dashboards for metrics logs and performance analytics across distributed systems.
Solid understanding of metrics logging alerting dashboards and observability pipelines.
Strong Linux administration skills (RHEL 7/8/9) including troubleshooting upgrades configuration patching and performance optimization.
Good understanding of SRE principles high availability scalability incident management and DR (Disaster Recovery) / BCP (Business Continuity Planning) activities
Experience with automation (e.g. Bash Python Ansible CI/CD tools) is an advantage.
Understanding of networking fundamentals performance tuning and troubleshooting distributed systems.
Prior experience in Production Support SRE Monitoring Engineering or Shared Services Operations with participation in oncall rotations including after-hours and weekend support.
Strong analytical problemsolving and communication skills with the ability to work collaboratively under pressure.
Self-motivated adaptable and able to prioritize learn continuously and manage multiple responsibilities effectively.