At eBay were more than a global ecommerce leader were changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. Were committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.
Our customers are our compass authenticity thrives bold ideas are welcome and everyone can bring their unique selves to work every day. Were in this together sustaining the future of our customers our company and our planet.
Join a team of passionate thinkers innovators and dreamers and help us connect people and build communities to create economic opportunity for all.
At eBay we are building the next-generation AI platform to power intelligent experiences for millions of users worldwide. Our AI Platform (AIP) provides the scalable secure and efficient foundation for deploying and optimizing advanced machine learning and large language model (LLM) workloads at production scale. We enable teams across eBay to move from experimentation to global deployment with speed reliability and efficiency.
We are seeking an experienced Machine Learning Platform Support Engineer to join our AI Platform this role you will be the first line of support (L1) for ML workloads running on Kubernetes and clusters. You will be responsible for triaging monitoring and resolving platform-related issues across ML training inference model deployment and GPU resource allocation.
This position includes participation in on-call rotations (PagerDuty) and requires close collaboration with ML Platform engineers researchers and platform teams to ensure the reliability scalability and usability of the AI Platform. You will play a critical role in ensuring operational excellence and maintaining the uptime of the core infrastructure that powers eBays global AI and ML systems.
Serve as the first point of contact (L1) for all support requests related to the AI/ML Platform including ML training inference model deployment and GPU allocation.
Provide operational and on-call (PagerDuty) support for and Kubernetes clusters running distributed ML workloads across cloud and on-prem environments.
Monitor triage and resolve platform incidents involving job failures scaling errors cluster instability or GPU resource contention.
Manage GPU quota allocation and scheduling across multiple user teams ensuring compliance with approved quotas and optimal resource utilization.
Support Ray Train/Tune for large-scale distributed training and Ray Serve for autoscaled inference maintaining performance and service reliability.
Troubleshoot Kubernetes workloads including pod scheduling networking image issues and resource exhaustion in multi-tenant namespaces.
Collaborate with platform engineers SREs and ML practitioners to resolve infrastructure orchestration and dependency issues impacting ML workloads.
Improve observability monitoring and alerting for Ray and Kubernetes clusters using Prometheus Grafana and OpenTelemetry to enable proactive issue detection.
Maintain and enhance runbooks automation scripts and knowledge base documentation to accelerate incident resolution and reduce recurring support requests.
Participate in root cause analysis (RCA) and post-incident reviews contributing to platform improvements and automation initiatives to minimize downtime.
Bachelors or Masters degree in Computer Science Engineering or related technical discipline (or equivalent experience).
5 years of experience in ML operations DevOps or platform support for distributed AI/ML systems.
Proven experience providing L1/L2 and on-call support for and Kubernetes-based clusters supporting ML training and inference workloads.
Strong understanding of Ray cluster operations including autoscaling job scheduling and workload orchestration across heterogeneous compute (CPU/GPU/accelerators).
Hands-on experience managing Kubernetes control plane and data plane components multi-tenant namespaces RBAC ingress and resource isolation.
Expertise in GPU scheduling allocation and monitoring (NVIDIA device plugin MIG configuration CUDA/NCCL optimization).
Proficiency in Python and/or Go for automation diagnostics and operational tooling in distributed environments.
Working knowledge of Kubernetes and cloud-native environments (AWS GCP Azure) and CI/CD pipelines.
Experience with observability stacks (Prometheus Grafana OpenTelemetry) and incident management tools (PagerDuty ServiceNow).
Familiarity with ML frameworks such as TensorFlow and PyTorch and their integration within distributed Ray/Kubernetes clusters.
Strong debugging analytical and communication skills to collaborate effectively with cross-functional engineering and research teams.
A customer-centric operationally disciplined mindset focused on maintaining platform reliability performance and user satisfaction.
Please see the Talent Privacy Noticefor information regarding how eBay handles your personal data collected when you use the eBay Careers website or apply for a job with eBay.
eBay is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race color religion national origin sex sexual orientation gender identity veteran status and disability or other legally protected you have a need that requires accommodation please contact us at. We will make every effort to respond to your request for accommodation as soon as possible. View our accessibility statement to learn more about eBays commitment to ensuring digital accessibility for people with disabilities.
The eBay Jobs website uses cookies to enhance your experience. By continuing to browse the site you agree to our use of cookies. Visit our Privacy Center for more information.
Founded in 1995 in San Jose, Calif., eBay (NASDAQ: EBAY) is where the world goes to shop, sell and give. Whether you’re buying new or used, common or luxurious, trendy or rare – if it exists in the world, it’s probably for sale on eBay. Our great value and unique selection help every ... View more