Site Reliability Engineer - High Performance Computing AI-ML

Twitter

Posted on : 11-04-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Austin - USA

Monthly Salary

$ 120000 - 297000

Vacancy

1 Vacancy

Posted on : 11-04-2025

Job Description

Role: Site Reliability Engineer HPC / AIML (All Levels)
Location: Palo Alto New York Seattle or Austin
Base Salary Range: $120000 to $297000 Equity

Who We Are:

At X were pioneering the frontier of technology with our innovative Everything App. Our mission is to revolutionize how people connect share ideas and engage in meaningful conversations. We champion freedom of speech and strive to create a platform that embraces diverse perspectives. Our commitment is to foster open dialogue and empower individuals to express themselves freely.

What Youll Do:

As a Site Reliability Engineer (SRE) supporting HPC (High Performance Computing) AI/ML initiatives at X you will play a crucial role in maintaining and enhancing the reliability availability and performance of our largescale systems. Your responsibilities will include:

Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux Kubernetes)
Collaborating with crossfunctional teams including hardware engineers and software developers to support and improve our infrastructure
Automating the provisioning and deployment of systems to enhance longterm health and scalability
Ensuring the robustness of our HPC environments and storage clusters
Writing and maintaining scripts and tools for automation and monitoring
Addressing system failures and performance issues identifying root causes and implementing preventive measures
Working closely with endusers to understand changing needs as our environment evolves.

Who You Are:

Were looking for exceptional engineers who are passionate about our mission and have a strong desire to make a meaningful impact. The ideal candidate will have:

2 years of professional software development experience
Extensive experience with Kubernetes and container orchestration
Proficiency in one or more objectoriented programming languages (e.g. Python Java C Scala)
Proficiency in scripting languages (Python Bash etc.
Strong experience in configuration management (e.g. puppet ansible chef etc.
Familiarity with Ethernet networking at scale and distributed systems
Strong troubleshooting skills and experience with HPC environments
Experience managing largescale systems ideally supporting thousands of machines
Working understanding of the storage systems required to support such environments
Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms.
Ability to think outside the box and come up with innovative solutions to complicated problems.
Extremely committed willing to work in a fast paced environment
Excellent communication and interpersonal skills

At X our small but fastpaced team values innovation creativity and a strong commitment to our mission. As a Site Reliability Engineer youll have the opportunity to make a significant impact on the future of X and our aspiration to build the Everything App.

Employment Type

Full-Time

Company Industry

Key Skills

Apply Now

About Company

Twitter

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Site Reliability Engineer - High Performance Computing AI-ML

Twitter

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

AI - ML

SRE (Site Reliability Engineer) - H/F

AI Engineer

Senior Site Reliability Engineer (Night shift 24/7 - 4 days)

System Reliability & Validation Engineer

Principal AI Engineer

Reliability Engineer and Maintenance Excellence Program Lead

Asset Reliability Specialist