About the Role:
We are seeking a highly motivated Production Support Engineer with 2 years of experience to ensure the continuous and efficient operation of our production this role you will be responsible for monitoring troubleshooting and resolving production issues in real-time as well as improving the overall stability and performance of our services.
You will work closely with development QA and operations teams to address incidents identify root causes and implement long-term solutions. If you thrive in high-pressure environments and enjoy problem-solving this could be a perfect fit for you.
Key Responsibilities:
Monitor the health performance and availability of production systems and services
- Diagnose and resolve production issues quickly minimizing downtime and impact on end-users
- Provide on-call support for production incidents and manage issue escalation as necessary
- Collaborate with development teams to investigate root causes of production issues and propose solutions
- Perform system health checks and regular system maintenance tasks to ensure optimal performance
- Implement monitoring tools and alerting systems to proactively identify potential issues before they impact users
- Deploy bug fixes patches and system upgrades in production environments
- Document issues resolution steps and operational procedures for knowledge sharing
- Assist in post-incident reviews and implement improvements based on lessons learned
- Help implement change management processes to ensure smooth and controlled deployments
- Ensure adherence to SLAs (Service Level Agreements) for incident resolution and response time
Qualifications:
Required:
- Bachelors degree in Computer Science Information Technology Engineering or a related field
- 2 years of experience in production support or operations management in a tech environment
- Familiarity with Linux/Unix or Windows server administration
- Strong experience with monitoring and alerting tools (e.g. Prometheus Grafana Nagios New Relic)
- Ability to work with log aggregation and analysis tools (e.g. ELK Stack Splunk)
- Proficiency in troubleshooting application infrastructure and network issues
- Experience with databases (e.g. MySQL PostgreSQL MongoDB)
- Knowledge of incident management tools (e.g. JIRA ServiceNow)
- Strong understanding of cloud platforms (e.g. AWS Azure GCP) and cloud infrastructure
- Familiarity with CI/CD pipelines and deployment automation tools
Preferred:
- Experience in automation and scripting (e.g. Bash Python Shell scripting)
- Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes
- Experience in load balancing scaling and disaster recovery practices
- Knowledge of ITIL or other IT operations frameworks
- Experience in release management and deployment strategies