Senior Site Reliabil....

Nebius - Belgrade - Serbia

Senior Site Reliability Engineer YT Platform العربية

Senior Site Reliability Engineer YT Platform

Nebius

Posted on : 06-03-2024

Employer Active

1 Vacancy

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Belgrade - Serbia

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 06-03-2024

Job Description

Req ID : 2595455

The company

Nebius AI is an AIcentric public cloud platform specifically crafted to serve AI models for training and inference.

Our mission is to help ML practitioners concentrate on their core jobs while DevOps MLOps and infrastructurerelated tasks are handled by us. The idea is to build an MLspecific cloud platform covering the entire ML lifecycle from A to Z: from data preparation and labeling to ML training and inference.

We recognize the potential of ML and AI technologies and aim to provide our future users with the perfect environment to train and finetune their models. We are committed to delivering the best user experience and excellent customer support

Four development hubs:
Nebius is headquartered in the Netherlands with hubs in Finland Serbia and Israel.

Data center in Europe:
Our own data center in Finland features server racks designed inhouse for MLspecific high load with powerefficient solutions including a freecooling system.

500 professionals:
Our mature team of engineers has a proven track record in developing sophisticated cloud and ML solutions and designing cuttingedge hardware.

The role

We are building a Nebius YTsaurus platform based on YTsauruswhich is a mature open source Big Data platform. It includes a distributed storage capable of storing exabytes of data a scalable MapReduce framework own OLTP kv storage and a message queue. It also use several highlevel computational engines such as YQL CHYT (powered by ClickHouse) and SPYT (powered by Apache Spark). Nowadays YTsaurus is gaining popularity as an instrument for AI and ML workloads serving as a solid foundation for dataset management and distributed machine learning operations. We are working to make Nebius YTsaurus a successful cloud product and a next generation AI Managed solution.

On this task you will have taskslike improving of the current Monitoring and Alerting ecosystem and participatingin building k8s operator for HW nodes maintenance.We have a big project coming up around our Kubernetes ecosystem where we will have a lot of new things to do!

In this position your responsibility will be to

Advise: Collaborate closely with engineering teams to design and develop highly resilient and performant systems at scale.
Diagnose: Use your understanding of distributed systems to quickly identify and resolve lowlevel challenges. Dive into complex issues such as networking load balancing and hardware maintenance and demonstrate your troubleshooting and problemsolving skills.
Automate: Identify bottlenecks and repetitive patterns in existing support processes and reduce their costs by introducing better automation.
Support: Participate in existing YTsaurus cluster operations including tasks such as deploying various microservices monitoring cluster health investigating issues and resolving bugs. This commitment extends to providing oncall support.
Contribute to OpenSource: Contribute to YTsaurus repository including the YT k8s operator and collaborate within the YTsaurus developers community.

We expect you to have:

5 years of experience in Software Engineering Site Reliability Engineering or a Development focused DevOps role.
Experience with Highload/Distributed systems Microservice architecture DBMS.
Proficiency in Python or Go.
Experience with Microservice architecture.
Strong working knowledge of Linux and containers Bash and administration tools for example tcpdump strace iotop iperf.
A comprehensive knowledge of Kubernetes.
Experience with IaC tools like Terraform/SaltStack/Ansible.
Readiness to occasionally read code in C/Java for reference and better understanding of YTsaurus internals.
Knowledge of standard algorithms and data structures.
Good communication and collaboration skills.

Why YTsaurus

Challenging projects and issues: Providing a reliable infrastructure for external and internal customers requires the development of a number of automation and operational tools around the largest k8sbased YTsaurus clusters.
Modern technology stack: Our cuttingedge technology stack places YTsaurus clusters at the core seamlessly deployed within Kubernetes. The orchestration and management of these clusters is performed by an open source YT k8s operator.
Collaborative Excellence: Work with highly skilled developers with many years of experience in high load distributed systems fostering a culture of innovation and excellence.

Does all that sound like your kind of challenge Then join us!

Employment Type

Full Time

Key Skills

Kubernetes

FMEA

Continuous Improvement

Mechanical Engineering

Manufacturing

Troubleshooting

Apply Now

About Company

Nebius

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Senior Site Reliability Engineer YT Platform

Nebius

Job Location

Monthly Salary

Vacancy

Job Description

Employment Type

Key Skills

About Company

Similar Jobs

Senior Product Manager B2C

Python Developer Ivory Team

Partnership Manager Datos Team

Product Designer Khaki Team

Java Developer Lion Team

Design Project Engineer

Design Project Engineer

Insights Analyst Analytics Insights Team