N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002 N-iX has come a long way expanding its presence across Europe the US and Latin America. Today we are a strong community of 2000 professionals and a reliable partner for global industry leaders and Fortune 500 companies.
Our client is a global commerce leader where you can influence how the world buys sells and gives. Youll be part of a work culture thats been genuinely committed to diversity and inclusion since its founding over twenty five years ago. Here you can be yourself do your best work along with a team of professionals and have a meaningful impact on people across the globe. We seek people with drive ideas and a passion for helping small businesses succeed to help.
We are seeking a highly motivated experienced SRE/MLOps engineer with Python and to build and maintain the next generation AI platform. This role focuses on developing software on top of open-source libraries such as Ray enabling internal teams to run ML workloads efficiently.
Responsibilities:
- Build refactor and release software for the AI platform (feature development and bug fixes)
- Deploy and manage applications on including workload management cluster deployment distributed task scheduling and troubleshooting
- Use Ray Dashboard and CLI tools to monitor and debug distributed jobs
- Work with Ray ecosystem libraries: Ray Train Ray Tune Ray Serve Ray Data
- Integrate with tools such as Airflow MLflow Dask DeepSpeed (a plus)
- Collaborate with AI platform developers to provide CI/CD pipelines for automated deployment and configuration
- Ensure high availability (target 99.999%) and monitor production systems.
- Develop automation for problem management and operational efficiency
- Write documentation and provide technical support for internal users
- Follow best practices for development: versioning source control branching and merging patterns.
Requirements:
- Main coding language: Python (C good to have)
- Strong experience with including at least two areas such as Ray Train or Ray Serve
- Kubernetes / Docker: Proficient / Experienced
- Hands-on experience with distributed systems cluster management and cloud technologies
- Familiarity with DevOps practices CI/CD pipelines and test automation
- Excellent problem-solving debugging and triaging skills
- Strong communication skills for collaboration with partners customers and engineers
Ability to manage multiple projects in a fast-paced environment - TensorRT DeepSpeed PyTorch Distributed - will be a plus
- English proficiency (oral and written).
Role specifics:
- Infra vs. coding requirements: 30% infrastructure (can be learned with guidance) 70% coding (essential for features and bug fixes)
- The role targets engineers rather than data scientists: focus on deployment abstractions monitoring and alerting of Ray applications at scale
- Ray proficiency is critical; second version of the platform will be built on Ray
- Understanding Racer for real-time serving and Ray Train for model training is required
We offer*:
- Flexible working format - remote office-based or flexible
- A competitive salary and good compensation package
- Personalized career growth
- Professional development tools (mentorship program tech talks and trainings centers of excellence and more)
- Active tech communities with regular knowledge sharing
- Education reimbursement
- Memorable anniversary presents
- Corporate events and team buildings
- Other location-specific benefits
*not applicable for freelancers
N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002 N-iX has come a long way expanding its presence across Europe the US and Latin America. Today we are a strong community of 2000 professionals and a relia...
N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002 N-iX has come a long way expanding its presence across Europe the US and Latin America. Today we are a strong community of 2000 professionals and a reliable partner for global industry leaders and Fortune 500 companies.
Our client is a global commerce leader where you can influence how the world buys sells and gives. Youll be part of a work culture thats been genuinely committed to diversity and inclusion since its founding over twenty five years ago. Here you can be yourself do your best work along with a team of professionals and have a meaningful impact on people across the globe. We seek people with drive ideas and a passion for helping small businesses succeed to help.
We are seeking a highly motivated experienced SRE/MLOps engineer with Python and to build and maintain the next generation AI platform. This role focuses on developing software on top of open-source libraries such as Ray enabling internal teams to run ML workloads efficiently.
Responsibilities:
- Build refactor and release software for the AI platform (feature development and bug fixes)
- Deploy and manage applications on including workload management cluster deployment distributed task scheduling and troubleshooting
- Use Ray Dashboard and CLI tools to monitor and debug distributed jobs
- Work with Ray ecosystem libraries: Ray Train Ray Tune Ray Serve Ray Data
- Integrate with tools such as Airflow MLflow Dask DeepSpeed (a plus)
- Collaborate with AI platform developers to provide CI/CD pipelines for automated deployment and configuration
- Ensure high availability (target 99.999%) and monitor production systems.
- Develop automation for problem management and operational efficiency
- Write documentation and provide technical support for internal users
- Follow best practices for development: versioning source control branching and merging patterns.
Requirements:
- Main coding language: Python (C good to have)
- Strong experience with including at least two areas such as Ray Train or Ray Serve
- Kubernetes / Docker: Proficient / Experienced
- Hands-on experience with distributed systems cluster management and cloud technologies
- Familiarity with DevOps practices CI/CD pipelines and test automation
- Excellent problem-solving debugging and triaging skills
- Strong communication skills for collaboration with partners customers and engineers
Ability to manage multiple projects in a fast-paced environment - TensorRT DeepSpeed PyTorch Distributed - will be a plus
- English proficiency (oral and written).
Role specifics:
- Infra vs. coding requirements: 30% infrastructure (can be learned with guidance) 70% coding (essential for features and bug fixes)
- The role targets engineers rather than data scientists: focus on deployment abstractions monitoring and alerting of Ray applications at scale
- Ray proficiency is critical; second version of the platform will be built on Ray
- Understanding Racer for real-time serving and Ray Train for model training is required
We offer*:
- Flexible working format - remote office-based or flexible
- A competitive salary and good compensation package
- Personalized career growth
- Professional development tools (mentorship program tech talks and trainings centers of excellence and more)
- Active tech communities with regular knowledge sharing
- Education reimbursement
- Memorable anniversary presents
- Corporate events and team buildings
- Other location-specific benefits
*not applicable for freelancers
View more
View less