Staff Engineer Engineering Compute Infrastructure and Grid Operations
Austin, TX - USA
Job Summary
About Marvell
Marvells semiconductor solutions are the essential building blocks of the data infrastructure that connects our world. Across enterprise cloud and AI and carrier architectures our innovative technology is enabling new possibilities.
At Marvell you can affect the arc of individual lives lift the trajectory of entire industries and fuel the transformative potential of tomorrow. For those looking to make their mark on purposeful and enduring innovation above and beyond fleeting trends Marvell is a place to thrive learn and lead.
Your Team Your Impact
Marvells semiconductor solutions are the essential building blocks of the data infrastructure that connects our world. Across enterprise cloud and AI and carrier architectures our innovative technology is enabling new possibilities.At Marvell you can affect the arc of individual lives lift the trajectory of entire industries and fuel the transformative potential of tomorrow. For those looking to make their mark on purposeful and enduring innovation above and beyond fleeting trends Marvell is a place to thrive learn and lead.
What You Can Expect
Job Summary
We are seeking a Senior Engineer to design operate and continuously improve the engineering compute infrastructure used for large-scale chip design and verification. This role is heavily focused on grid job management storage systems reliability and operational excellence in high-throughput compute environments.
The ideal candidate has strong IT and systems skills deep experience with batch schedulers and distributed storage and a passion for diagnosing and preventing large-scale job failures that impact engineering productivity.
Key Responsibilities Grid & Job Management
Own and evolve grid job management infrastructure used for large regressions and high-volume batch workloads.
Debug and resolve grid job failures including scheduling issues hung jobs resource starvation and intermittent infrastructure faults.
Improve job reliability through watchdogs retries heartbeats timeouts and failure detection mechanisms.
Work with job controllers and wrapper layers to ensure consistent behavior across grid environments (e.g. LSF UGE).
Partner with IT and compute teams during grid migrations upgrades and expansions.
Key Responsibilities Storage & Filesystem Infrastructure
Develop deep operational understanding of shared engineering storage systems used by compute jobs.
Diagnose and resolve issues related to I/O performance file contention permissions and cross-mounted filesystems.
Identify and mitigate storage-related failure modes that cause job instability or data corruption.
Collaborate with IT teams on filesystem migrations maintenance windows and outage prevention.
Key Responsibilities Reliability Monitoring & Prevention
Proactively identify systemic issues that lead to grid instability or job loss.
Design and deploy monitoring logging and metrics to detect infrastructure problems early.
Perform root-cause analysis of complex intermittent failures affecting compute storage or networking.
Define best practices and guardrails to prevent repeat incidents and improve overall system robustness.
Key Responsibilities Cross-Team Collaboration
Act as a technical bridge between engineering users tools teams and central IT.
Translate engineering workload requirements into actionable infrastructure improvements.
Communicate clearly during incidents maintenance events and post-mortems.
Document operational procedures and share knowledge to reduce support burden.
What Were Looking For
Qualifications and Skills
Bachelors degree in computer science Computer Engineering Electrical Engineering or equivalent experience.
8 years of experience in compute infrastructure grid operations or large-scale engineering environments.
Strong experience with grid or batch schedulers (e.g. LSF UGE Slurm PBS).
Hands-on experience debugging distributed systems and batch job failures.
Strong Linux systems knowledge including process management and resource monitoring.
Experience with shared storage systems (NFS enterprise filers high-performance filesystems).
Strong scripting skills in Python shell or similar languages.
Preferred Qualifications
Experience supporting EDA or engineering compute workloads.
Familiarity with job controller or wrapper-based execution architectures.
Experience operating environments with thousands of concurrent batch jobs.
Exposure to cloud or hybrid compute environments.
Prior involvement in grid or filesystem migrations.
Strong incident response and post-mortem leadership skills.
Expected Base Pay Range (USD)
128000 - 189370 $ per annumThe successful candidates starting base pay will be determined based on job-related skills experience qualifications work location and market conditions. The expected base pay range for this role may be modified based on market conditions.
Additional Compensation and Benefit Elements
Marvell is committed to providing exceptional comprehensive benefits that support our employees at every stage - from internship to retirement and through lifes most important moments. Our offerings are built around four key pillars: financial well-being family support mental and physical health and recognition. Highlights include an employee stock purchase plan with a 2-year look back family support programs to help balance work and home life robust mental health resources to prioritize emotional well-being and a recognition and service awards to celebrate contributions and milestones. We look forward to sharing more with you during the interview process.All qualified applicants will receive consideration for employment without regard to race color religion sex national origin sexual orientation gender identity disability or protected veteran status.
Any applicant who requires a reasonable accommodation during the selection process should contact Marvell HR Helpdesk at .
Interview Integrity
To support fair and authentic hiring practices candidates are not permitted to use AI tools (such as transcription apps real-time answer generators like ChatGPT or Copilot or automated note-taking bots) during interviews.
These tools must not be used to record assist with or enhance responses in any way. Our interviews are designed to evaluate your individual experience thought process and communication skills in real time. Use of AI tools without prior instruction from the interviewer will result in disqualification from the hiring process.
This position may require access to technology and/or software subject to U.S. export control laws and regulations including the Export Administration Regulations (EAR). As such applicants must be eligible to access export-controlled information as defined under applicable law. Marvell may be required to obtain export licensing approval from the U.S. Department of Commerce and/or the U.S. Department of State. Except for U.S. citizens lawful permanent residents or protected individuals as defined by 8 U.S.C. 1324b(a)(3) all applicants may be subject to an export license review process prior to employment.
#LI-JT2Required Experience:
Staff IC
About Company
Designed for your current needs and future ambitions, Marvell delivers the data infrastructure technology transforming tomorrow’s enterprise, cloud, automotive, and carrier architectures for the better.