We are seeking a highly skilled and experiencedAI Systems Engineerto join our team. This is a hands-on senior individual contributor role that will be pivotal in leading the development operations and support of our entire AI infrastructure. You will be responsible for the entire lifecycle of our AI systems from architecting and building high-performance GPU clusters to deploying and optimizing our most advanced AI models and agentic services.
Responsibilities
AI Infrastructure Architecture & Strategy:Lead the design and implementation of our next-generation AI infrastructure to support our Agentic AI initiatives. You will define the technical strategy for our on-premise GPU clusters storage solutions and networking to ensure optimal performance scalability and reliability for all our AI workloads.
Cloud AI Service Integration:Support and secure the use of public cloud AI services includingAzure OpenAI servicesandGoogle Cloud Platform (GCP)services likeGemini. This includes managing secure access monitoring usage and tracking billing to ensure cost-effectiveness. You will also have hands-on experience supporting compute GPUs and AI services on both GCP and Azure.
Hands-on GPU Cluster Management:Take a leadership role in the configuration installation and optimization of GPU server clusters. This includes advanced troubleshooting of hardware and software performance tuning and implementing best practices for cluster utilization and resource management. You will be an expert in administering job schedulers likeLSFin a production environment including integration withDockerfor containerized job submission.
Full-Stack AI Tech Stack Development & Operations:Architect and deploy a robust and scalable AI tech stack. You will be responsible for the end-to-end operational lifecycle including setting up and managing deep learning frameworks (PyTorchTensorFlow) containerization withDockerandKubernetes and implementing CI/CD pipelines for AI model development.
Advanced LLM Deployment & Optimization:Lead the deployment serving and optimization of Large Language Models (LLMs). You will be an expert in techniques such as model quantization distillation and using high-performance serving frameworks (e.g.vLLMTGITensorRT-LLM) to maximize inference throughput and minimize latency.
Agentic AI Workflow & Service Engineering:Architect and build production-grade Agentic AI workflows and services. You will be responsible for the technical design and implementation of systems that integrate LLMs with external tools APIs and databases and will mentor other engineers on building robust and scalable AI agent applications.
Automation & Monitoring:Develop and maintain automation scripts using languages likePythonBash orPerlto streamline system maintenance deployment and reporting. Implement and manage monitoring solutions for system health job statuses GPU utilization and container performance to proactively identify and resolve issues.
AI Systems Support & Mentorship:Act as the final escalation point for the most complex technical issues related to our AI infrastructure. You will also serve as a technical leader and mentor to other engineers providing guidance on best practices in AI systems engineering performance tuning and operational excellence.
Security and Compliance:Develop and implement security best practices for our AI systems and data ensuring compliance with relevant regulations and protecting our intellectual property.
Required Skills and Qualifications
10 years of experience in a senior technical role with at least 5 years focused on building and operating high-performance computing or AI infrastructure. Proven track record as a Principal or Senior Staff Engineer.
Expert-level knowledge ofNVIDIA GPU architectureand technologies likeCUDAandcuDNN. Extensive experience with multi-GPU and multi-node training and inference.
Proven experience with public cloud AI services specifically managing access usage and billing forAzure OpenAIandGoogle Cloud Platform (GCP)services.
Extensive hands-on experience withDocker: image management container orchestration and troubleshooting.
Proficiency in scripting languages such asPythonBash orPerl.
Deep expertise inLinux system administration(RHELpreferred) including networking storage and performance tuning.
Familiarity with user authentication and integration using systems likeLDAPorActive Directory.
Strong problem-solving and communication skills with the ability to work in a multi-platform cross-functional and geographically distributed team.
Preferred/Bonus Skills
Understanding of AI job profiling and tuning (memory GPU I/O).
Experience administeringLSFclusters in a production or research environment. Familiarity with other job schedulers likeSlurmis a plus.
Experience withLSF Docker integrationand job submission using container images.
Experience withmacOS/AppleSiliconsystem admin tasks and troubleshooting.
The annual salary range for California is $136500 to $253500. You may also be eligible to receive incentive compensation: bonus equity and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the salary range is a guideline and compensation may vary based on factors such as qualifications skill level competencies and work location. Our benefits programs include: paid vacation and paid holidays 401(k) plan with employer match employee stock purchase plan a variety of medical dental and vision plan options and more.
Required Experience:
Staff IC
Do you want to shape the future of technology? Cadence is leading the charge to solve some of technology’s toughest challenges. We work with the world’s most innovative companies, across a growing range of industries. Major trends that you hear about everyday – like artificial intell ... View more