If you have a Candidate Login already but have forgotten your password please use the steps to reset your password. If you have forgotten your email login please contact subject Workday Candidate Login
When creating your Workday account and entering personal information like name address please do not use ALL CAPS.
Thank you!
NOTICE:For EMEA Jobs please review the Privacy Policy here
Job Responsibilities:
We are seeking a Senior AI/ML Scalability Engineer to design and optimize platforms capable of supporting the significant growth of AI/ML workloads. This role is focused on ensuring the scalability reliability and efficiency of AI/ML infrastructure while contributing to the development of robust high-performance systems. The ideal candidate will collaborate with cross-functional teams to build resilient infrastructure and implement solutions that ensure seamless model deployment monitoring and lifecycle management at scale.Key Responsibilities
Platform Scalability: Design and implement scalable solutions for AI/ML infrastructure enabling horizontal scaling efficient resource utilization and fault tolerance under high-demand scenarios.
Stability & Reliability: Apply best practices for platform stability high availability and disaster recovery ensuring uninterrupted operations during peak workloads.
Observability & Monitoring: Build and maintain advanced observability frameworks including monitoring logging and tracing solutions leveraging tools like Datadog.
Automation & Efficiency: Develop automation pipelines for infrastructure provisioning deployment and operational workflows to minimize manual intervention and maximize efficiency.
Cross-Functional Collaboration: Work closely with data science product and engineering teams to align infrastructure capabilities with organizational goals and ensure seamless model deployment testing and lifecycle management.
Cost Optimization: Implement strategies to optimize cloud resource usage and manage platform costs effectively while maintaining performance and reliability.
Incident Response: Participate in incident response efforts including post-mortems and root cause analyses to improve platform resilience and prevent recurring issues.
Continuous Improvement: Stay current with industry trends in cloud infrastructure distributed systems and observability applying innovative solutions to enhance platform scalability and performance.
Education:
Educational Background: Bachelors or Masters degree in Computer Science Engineering or a related field.
Experience:
5 years of experience in AI/ML platform engineering infrastructure or operations.
Proven track record of designing scaling and maintaining large distributed systems with a focus on scalability stability and performance.
Technical Expertise:
Expertise in cloud infrastructure (AWS GCP Azure) and infrastructure-as-code tools (Terraform CloudFormation etc.).
Strong programming skills in Python and with experience building scalable maintainable systems.
Deep understanding of observability practices including distributed tracing log aggregation and real-time monitoring.
Other Qualifications:
Scalability & Reliability: Proven ability to design scalable architectures and implement solutions for automated failover and disaster recovery. Experience in optimizing performance and resource utilization for high-demand environments.
Communication & Collaboration: Strong communication skills capable of articulating technical concepts to both technical and non-technical stakeholders. Ability to collaborate effectively with cross-functional teams to deliver integrated solutions.
Problem-Solving Skills: Excellent problem-solving skills and the ability to address complex technical challenges in a fast-paced environment.
Cost Optimization: Experience with cost management strategies for cloud-based platforms with a focus on maintaining an optimal balance between performance and cost.
Additional Job Details:
Required Experience:
Senior IC
Welocalize enables brands to reach and grow global audiences through services and solutions for translation, localization, adaptation, interpretation, and automation. We offer multilingual solutions to transform all content types for local audiences, at every step of our clients’ glob ... View more