Key Responsibilities:
- 12 years of experience.
- Design and develop enterprise-grade APIs and configuration solutions.
- Contribute to enterprise and application architecture design.
- Lead observability initiatives including monitoring alerting and incident response.
- Build and maintain dashboards and alerting systems using Grafana Prometheus Splunk etc.
- Create and maintain detailed runbooks for operational procedures and incident handling.
- Define and monitor SLAs SLOs and KPIs for critical services.
- Collaborate with architecture development and security teams to ensure system reliability.
- Evaluate and adopt new technologies to improve system performance and maintainability.
Required Skills:
- Strong background in IT infrastructure cloud platforms (AWS Azure GCP) and SRE practices.
- Experience in enterprise and application architecture.
- Proven experience in building APIs and backend services.
Hands-on experience with tools:
- Monitoring & Observability: Grafana Prometheus Splunk
- ITSM & Operations: ServiceNow OpsRamp
- Project & Incident Tracking: JIRA
- Experience in building alerts dashboards and operational runbooks.
- Experience managing distributed systems and large-scale production environments.
- Strong leadership communication and problem-solving skills.
- Ability to quickly learn and adapt to new technologies and environments.
Preferred:
- Exposure to OpenShift and Azure cloud platforms.
- Certifications: SRE Foundation ITIL or relevant cloud certifications.