About Clockwork Systems
Software Driven Fabrics to increase GPU cluster utilization
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex traditional infrastructure struggles to meet the demands of performance reliability and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems workload fault tolerance to keep jobs running through failures and performance acceleration that dynamically routes and paces traffic to avoid congestion.
To learn more visit .
About The Role
We are looking for a strong Software Engineer to help design build and scale backend systems for AI and GPU cluster observability. In this role you will work on high-performance distributed systems that power telemetry ingestion data processing and APIs for monitoring large-scale GPU clusters and AI workloads.
What Youll Do
- Design and build scalable backend systems for metric collection processing and analysis.
- Develop robust methods to detect complex infrastructure issues that impact AI workloads.
- Build large distributed systems running in production environments.
- Collaborate across teams to deliver reliable performant and maintainable systems.
What We Are Looking For
- 2 years of industry experience building and operating production software systems.
- Strong foundation in data structures algorithms and software design.
- Fluency in one or more programming languages: C C Go Java or Python.
- Solid understanding of operating systems fundamentals (threads scheduling synchronization; kernel programming is a plus).
- Experience with databases including design development or scaling.
- Excellent debugging problem-solving and communication skills.
Nice to Have
- Knowledge of networking protocols; familiarity with NIC architecture and operation.
- Understanding of GPU or AI infrastructure (e.g. DCGM PyTorch).
- Familiarity with observability systems (metrics logs traces); experience with OpenTelemetry Prometheus or distributed tracing is a bonus.
- Experience designing building and scaling large distributed systems.
- Hands-on experience with service-oriented architectures and cloud platforms (AWS GCP Azure)
Enjoy
- Challenging projects.
- A friendly and inclusive workplace culture.
- Competitive compensation.
- A great benefits package.
- Catered lunch.
Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race color ancestry religion age sex sexual orientation gender identity or expression national origin disability or protected veteran status. We believe diversity drives innovation and we grow stronger together.
Required Experience:
IC
About Clockwork Systems Software Driven Fabrics to increase GPU cluster utilizationClockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex traditional in...
About Clockwork Systems
Software Driven Fabrics to increase GPU cluster utilization
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex traditional infrastructure struggles to meet the demands of performance reliability and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems workload fault tolerance to keep jobs running through failures and performance acceleration that dynamically routes and paces traffic to avoid congestion.
To learn more visit .
About The Role
We are looking for a strong Software Engineer to help design build and scale backend systems for AI and GPU cluster observability. In this role you will work on high-performance distributed systems that power telemetry ingestion data processing and APIs for monitoring large-scale GPU clusters and AI workloads.
What Youll Do
- Design and build scalable backend systems for metric collection processing and analysis.
- Develop robust methods to detect complex infrastructure issues that impact AI workloads.
- Build large distributed systems running in production environments.
- Collaborate across teams to deliver reliable performant and maintainable systems.
What We Are Looking For
- 2 years of industry experience building and operating production software systems.
- Strong foundation in data structures algorithms and software design.
- Fluency in one or more programming languages: C C Go Java or Python.
- Solid understanding of operating systems fundamentals (threads scheduling synchronization; kernel programming is a plus).
- Experience with databases including design development or scaling.
- Excellent debugging problem-solving and communication skills.
Nice to Have
- Knowledge of networking protocols; familiarity with NIC architecture and operation.
- Understanding of GPU or AI infrastructure (e.g. DCGM PyTorch).
- Familiarity with observability systems (metrics logs traces); experience with OpenTelemetry Prometheus or distributed tracing is a bonus.
- Experience designing building and scaling large distributed systems.
- Hands-on experience with service-oriented architectures and cloud platforms (AWS GCP Azure)
Enjoy
- Challenging projects.
- A friendly and inclusive workplace culture.
- Competitive compensation.
- A great benefits package.
- Catered lunch.
Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race color ancestry religion age sex sexual orientation gender identity or expression national origin disability or protected veteran status. We believe diversity drives innovation and we grow stronger together.
Required Experience:
IC
View more
View less