Software Engineer AI Infra Visibility

Palo Alto, CA - USA

Monthly Salary: $ 140000 - 210000

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

About Clockwork Systems

Software Driven Fabrics to increase GPU cluster utilization

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex traditional infrastructure struggles to meet the demands of performance reliability and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems workload fault tolerance to keep jobs running through failures and performance acceleration that dynamically routes and paces traffic to avoid congestion.

To learn more visit .

About The Role

We are looking for a strong Software Engineer to help design build and scale backend systems for AI and GPU cluster observability. In this role you will work on high-performance distributed systems that power telemetry ingestion data processing and APIs for monitoring large-scale GPU clusters and AI workloads.

What Youll Do

Design and build scalable backend systems for metric collection processing and analysis.
Develop robust methods to detect complex infrastructure issues that impact AI workloads.
Build large distributed systems running in production environments.
Collaborate across teams to deliver reliable performant and maintainable systems.

What We Are Looking For

2 years of industry experience building and operating production software systems.
Strong foundation in data structures algorithms and software design.
Fluency in one or more programming languages: C C Go Java or Python.
Solid understanding of operating systems fundamentals (threads scheduling synchronization; kernel programming is a plus).
Experience with databases including design development or scaling.
Excellent debugging problem-solving and communication skills.

Nice to Have

Knowledge of networking protocols; familiarity with NIC architecture and operation.
Understanding of GPU or AI infrastructure (e.g. DCGM PyTorch).
Familiarity with observability systems (metrics logs traces); experience with OpenTelemetry Prometheus or distributed tracing is a bonus.
Experience designing building and scaling large distributed systems.
Hands-on experience with service-oriented architectures and cloud platforms (AWS GCP Azure)

Enjoy

Challenging projects.
A friendly and inclusive workplace culture.
Competitive compensation.
A great benefits package.
Catered lunch.

Compensation for this position will vary based on the skills and experience you bring as well as internal equity considerations. For candidates hired at the posted level the expected base salary range is $140000 - $210000. The offered compensation package may also include stock options or other equity awards subject to Clockworks equity program and applicable approvals.

Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race color ancestry religion age sex sexual orientation gender identity or expression national origin disability or protected veteran status. We believe diversity drives innovation and we grow stronger together.

Required Experience:

About Clockwork Systems Software Driven Fabrics to increase GPU cluster utilizationClockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex traditional in...