Were looking for an engineer to lead the design development and scaling of our machine learning infrastructure. This role is ideal for someone who thrives at the intersection of systems engineering and applied machine learning. Youll be responsible for building robust scalable and maintainable infrastructure to support the full ML lifecycle - from data ingestion and feature computation to training deployment and monitoring in play a critical role in:- Designing and maintaining high-throughput low-latency pipelines for real-time and batch inference.- Automating the model training and evaluation workflows with reproducibility and traceability in mind.- Defining infrastructure standards and best practices for ML experimentation CI/CD and observability.- Collaborating with ML researchers and engineers to improve productivity through tooling and platform thrive in fast-paced dynamic environments and are comfortable navigating ambiguity to deliver meaningful incremental impact. You bring strong problem-solving skills operate with a high degree of autonomy and have a track record of executing effectively. With a commitment to continuous learning and attention to detail you actively seek opportunities to innovate and share knowledge. You follow engineering best practices - including unit testing CI/CD documentation monitoring and alerting - to ensure reliable maintainable solutions.
7 years of development experience and Bachelors or Masters degree in Computer Science or 5 years development experience and PhD in Computer science or related field with at least 3 years focused on large-scale machine learning infrastructure
Proficient in Python with solid knowledge of software design principles.
Expertise in designing and implementing distributed systems or data pipelines (e.g. Spark Flink Kafka Airflow) and knowledge of SQL to analyze data and derive insights.
Experience with ML lifecycle tools (e.g. MLflow Metaflow Kubeflow SageMaker Vertex AI).
Hands-on experience with container orchestration and cloud-native services (e.g. Kubernetes Docker AWS/GCP/Azure).
Leadership experience including being a technical lead for complex cross functional development projects demonstrating good technical judgement and prioritization skills. Strong communication skills and a proactive ownership-driven mindset.
Prior experience architecting ML platforms or Feature Stores in a fast-paced production environment.
Experience with real-time model serving and streaming pipelines (e.g. Kafka Flink Ray Serve Triton).
Experience optimizing GPU and CPU resource allocation for training and inference workloads.
Experience with any ML authoring framework (PyTorch TensorFlow JAX etc.) particularly on-device ML frameworks such as CoreML TFLite or ExecuTorch.
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.