SRE MLOps Engineer – (Python)

Hyderabad - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002 N-iX has come a long way expanding its presence across Europe the US and Latin America. Today we are a strong community of 2000 professionals and a reliable partner for global industry leaders and Fortune 500 companies.

Our client is a global commerce leader where you can influence how the world buys sells and gives. Youll be part of a work culture thats been genuinely committed to diversity and inclusion since its founding over twenty five years ago. Here you can be yourself do your best work along with a team of professionals and have a meaningful impact on people across the globe. We seek people with drive ideas and a passion for helping small businesses succeed to help.

We are seeking a highly motivated experienced SRE/MLOps engineer with Python and to build and maintain the next generation AI platform. This role focuses on developing software on top of open-source libraries such as Ray enabling internal teams to run ML workloads efficiently.

Responsibilities:

Build refactor and release software for the AI platform (feature development and bug fixes)
Deploy and manage applications on including workload management cluster deployment distributed task scheduling and troubleshooting
Use Ray Dashboard and CLI tools to monitor and debug distributed jobs
Work with Ray ecosystem libraries: Ray Train Ray Tune Ray Serve Ray Data
Integrate with tools such as Airflow MLflow Dask DeepSpeed (a plus)
Collaborate with AI platform developers to provide CI/CD pipelines for automated deployment and configuration
Ensure high availability (target 99.999%) and monitor production systems.
Develop automation for problem management and operational efficiency
Write documentation and provide technical support for internal users
Follow best practices for development: versioning source control branching and merging patterns.

Requirements:

Main coding language: Python (C good to have)
Strong experience with including at least two areas such as Ray Train or Ray Serve
Kubernetes / Docker: Proficient / Experienced
Hands-on experience with distributed systems cluster management and cloud technologies
Familiarity with DevOps practices CI/CD pipelines and test automation
Excellent problem-solving debugging and triaging skills
Strong communication skills for collaboration with partners customers and engineers
Ability to manage multiple projects in a fast-paced environment
TensorRT DeepSpeed PyTorch Distributed - will be a plus
English proficiency (oral and written).

Role specifics:

Infra vs. coding requirements: 30% infrastructure (can be learned with guidance) 70% coding (essential for features and bug fixes)
The role targets engineers rather than data scientists: focus on deployment abstractions monitoring and alerting of Ray applications at scale
Ray proficiency is critical; second version of the platform will be built on Ray
Understanding Racer for real-time serving and Ray Train for model training is required

We offer*:

Flexible working format - remote office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program tech talks and trainings centers of excellence and more)
Active tech communities with regular knowledge sharing
Education reimbursement
Memorable anniversary presents
Corporate events and team buildings
Other location-specific benefits

*not applicable for freelancers

Responsibilities:

Build refactor and release software for the AI platform (feature development and bug fixes)
Deploy and manage applications on including workload management cluster deployment distributed task scheduling and troubleshooting
Use Ray Dashboard and CLI tools to monitor and debug distributed jobs
Work with Ray ecosystem libraries: Ray Train Ray Tune Ray Serve Ray Data
Integrate with tools such as Airflow MLflow Dask DeepSpeed (a plus)
Collaborate with AI platform developers to provide CI/CD pipelines for automated deployment and configuration
Ensure high availability (target 99.999%) and monitor production systems.
Develop automation for problem management and operational efficiency
Write documentation and provide technical support for internal users
Follow best practices for development: versioning source control branching and merging patterns.

Requirements:

Main coding language: Python (C good to have)
Strong experience with including at least two areas such as Ray Train or Ray Serve
Kubernetes / Docker: Proficient / Experienced
Hands-on experience with distributed systems cluster management and cloud technologies
Familiarity with DevOps practices CI/CD pipelines and test automation
Excellent problem-solving debugging and triaging skills
Strong communication skills for collaboration with partners customers and engineers
Ability to manage multiple projects in a fast-paced environment
TensorRT DeepSpeed PyTorch Distributed - will be a plus
English proficiency (oral and written).

Role specifics:

Infra vs. coding requirements: 30% infrastructure (can be learned with guidance) 70% coding (essential for features and bug fixes)
The role targets engineers rather than data scientists: focus on deployment abstractions monitoring and alerting of Ray applications at scale
Ray proficiency is critical; second version of the platform will be built on Ray
Understanding Racer for real-time serving and Ray Train for model training is required

We offer*:

Flexible working format - remote office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program tech talks and trainings centers of excellence and more)
Active tech communities with regular knowledge sharing
Education reimbursement
Memorable anniversary presents
Corporate events and team buildings
Other location-specific benefits

*not applicable for freelancers