RITM0437736-Production Support- Mobilewebsite

Randstad India

Not Interested
Bookmark
Report This Job

profile Job Location:

Gurgaon - India

profile Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

The role involves troubleshooting live incidents performing root cause analysis and coordinating with development and DevOps teams to ensure quick resolution and minimal downtime. The ideal candidate should have strong debugging skills hands-on Kubernetes exposure and working knowledge of modern web application stacks. --- ## Key Responsibilities ### 1 Incident Management & Support * Monitor production systems and respond to alerts and incidents * Perform **L2 troubleshooting** for application infrastructure and data issues * Analyze logs metrics and traces to identify root causes * Provide workarounds and permanent fixes in coordination with engineering teams * Participate in on-call rotation and ensure SLA adherence * Maintain incident reports and RCA documentation --- ### 2 Kubernetes & Infrastructure Operations * Debug issues in Kubernetes environments: * Pod failures / CrashLoopBackOff * Memory & CPU throttling * ConfigMap / Secret issues * Networking & ingress problems * Restart services scale deployments and validate rollouts * Assist in release validation and smoke testing post-deployment --- ### 3 Application Troubleshooting Working knowledge required across stack: **Frontend** * UI issue analysis * Browser console & network debugging **Backend / Middleware** * Kafka consumer lag / stuck messages * Redis cache inconsistency / eviction issues * API failures & timeout troubleshooting **Database** * PostgreSQL query failures & locks * MongoDB connection & performance issues * Data correction and validation (read/update scripts as needed) --- ### 4 Monitoring & Observability * Monitor system health using **Datadog or equivalent observability tools** * Create dashboards and alerts for proactive detection * Analyze: * Logs * Metrics * Distributed traces * Identify recurring issues and recommend preventive measures --- ### 5 Release & Change Support * Support production deployments and hotfix releases * Validate deployments and rollback if required * Coordinate with Dev QA and DevOps teams * Maintain runbooks and operational documentation --- ## Required Skills * Hands-on experience with **Kubernetes production environments** * Experience supporting distributed applications * Knowledge of: * basics * Kafka messaging systems * Redis caching * PostgreSQL * MongoDB * Experience with monitoring tools (Datadog ELK Prometheus Grafana) * Strong Linux troubleshooting skills * Good understanding of APIs & microservices architecture --- ## Good to Have * Basic scripting (Bash / Python) * CI/CD pipeline understanding * Cloud platforms (AWS/Azure/GCP) * Performance tuning exposure --- ## Soft Skills * Strong analytical & debugging mindset * Calm under pressure during production incidents * Good communication and stakeholder handling * Documentation discipline --- ## Key Success Metrics * MTTR (Mean Time to Resolve) * Incident SLA adherence * Reduction in repeat incidents * Quality of RCA documentation * Proactive monitoring improvements ---

The role involves troubleshooting live incidents performing root cause analysis and coordinating with development and DevOps teams to ensure quick resolution and minimal downtime. The ideal candidate should have strong debugging skills hands-on Kubernetes exposure and working knowledge of modern web application stacks. --- ## Key Responsibilities ### 1 Incident Management & Support * Monitor production systems and respond to alerts and incidents * Perform **L2 troubleshooting** for application infrastructure and data issues * Analyze logs metrics and traces to identify root causes * Provide workarounds and permanent fixes in coordination with engineering teams * Participate in on-call rotation and ensure SLA adherence * Maintain incident reports and RCA documentation --- ### 2 Kubernetes & Infrastructure Operations * Debug issues in Kubernetes environments: * Pod failures / CrashLoopBackOff * Memory & CPU throttling * ConfigMap / Secret issues * Networking & ingress problems * Restart services scale deployments and validate rollouts * Assist in release validation and smoke testing post-deployment --- ### 3 Application Troubleshooting Working knowledge required across stack: **Frontend** * UI issue analysis * Browser console & network debugging **Backend / Middleware** * Kafka consumer lag / stuck messages * Redis cache inconsistency / eviction issues * API failures & timeout troubleshooting **Database** * PostgreSQL query failures & locks * MongoDB connection & performance issues * Data correction and validation (read/update scripts as needed) --- ### 4 Monitoring & Observability * Monitor system health using **Datadog or equivalent observability tools** * Create dashboards and alerts for proactive detection * Analyze: * Logs * Metrics * Distributed traces * Identify recurring issues and recommend preventive measures --- ### 5 Release & Change Support * Support production deployments and hotfix releases * Validate deployments and rollback if required * Coordinate with Dev QA and DevOps teams * Maintain runbooks and operational documentation --- ## Required Skills * Hands-on experience with **Kubernetes production environments** * Experience supporting distributed applications * Knowledge of: * basics * Kafka messaging systems * Redis caching * PostgreSQL * MongoDB * Experience with monitoring tools (Datadog ELK Prometheus Grafana) * Strong Linux troubleshooting skills * Good understanding of APIs & microservices architecture --- ## Good to Have * Basic scripting (Bash / Python) * CI/CD pipeline understanding * Cloud platforms (AWS/Azure/GCP) * Performance tuning exposure --- ## Soft Skills * Strong analytical & debugging mindset * Calm under pressure during production incidents * Good communication and stakeholder handling * Documentation discipline --- ## Key Success Metrics * MTTR (Mean Time to Resolve) * Incident SLA adherence * Reduction in repeat incidents * Quality of RCA documentation * Proactive monitoring improvements ---

The role involves troubleshooting live incidents performing root cause analysis and coordinating with development and DevOps teams to ensure quick resolution and minimal downtime. The ideal candidate should have strong debugging skills hands-on Kubernetes exposure and working knowledge of modern...
View more view more

Key Skills

  • History
  • Insurance Management
  • JDE
  • Administration Office
  • Catering Operations