Network Reliability Engineer

Margo


Job Location:

Warsaw - Poland

Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

#HPC #AI #GPU #CLUSTERS
YOUR DAILY ROUTINE
- Build a large AI infrastructure with monitoring diagnosis and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams
- Participate in an on-call rotation to handle incidents and ensure service continuity
- Implement and maintain observability solutions to monitor AI infrastructure and application health
- Contribute to AI infrastructure lifecycle management across different environments and countries
- Promote and apply best practices in terms of stability resiliency scalability and security
- Maintain clear technical documentation for tools and procedures
- Contribute to system and tool evolution based on production feedback
- Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives
ABOUT YOU
SOFTSKILLS :
- Proactive and solution-oriented mindset
- Passion for automation and continuous improvement
- Strong collaboration and communication skills
- Ability to work independently and in a team
- Willingness to mentor and share knowledge
HARDSKILLS :
- Experience with Go or Python
- Strong scripting skills (Bash Python)
- Hands-on experience with Linux systems (Ubuntu/Debian)
- Preferred hands-on experience with GPU & HPC infrastructure
- Knowledge of networking (VLAN/LAN TCP/IP DNS BGP load-balancing IPv6 etc.)
- Familiarity with monitoring and logging tools (Prometheus Grafana Elastic etc.)
- Comfortable with Infrastructure-as-Code (Ansible Salt AWX etc.)
- Experience managing relational databases (MariaDB)
- Understanding of CI/CD pipelines (GitLab)
- Comfortable with English (written and spoken)
200 zł - 250 zł an hour
We may use artificial intelligence (AI) tools to support parts of the hiring process such as reviewing applications analyzing resumes or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed please contact us.

Required Experience:

IC

#HPC #AI #GPU #CLUSTERSYOUR DAILY ROUTINE- Build a large AI infrastructure with monitoring diagnosis and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams- Participate in an on-call rotation to handle incidents and ensure se...

About Company

Company Logo

Nous avons choisi de mener auprès de nos clients uniquement des missions à haute complexité IT. En effet, nous sommes persuadés que la complexité des missions est corrélée à leur valeur ajoutée. MARGO est ainsi partenaire de clients ambitieux et visionnaires misant sur le PROGRÈS. ... View more

View Profile View Profile