We are looking for a Machine Learning Operations Engineer to join our team to design build and integrate ML Ops for large-scale distributed machine learning systems focusing on cutting-edge tools distributed GPU training and enhancing research experimentation.
About SatSure:
SatSure is a deep tech decision Intelligence company that works primarily at the nexus of agriculture infrastructure and climate action creating an impact for the other millions focusing on the developing world. We want to make insights from earth observation data accessible to all.
Join us to be at the forefront of building a deep tech company from India that solves problems for the globe.
Roles & Responsibilities:
- Architect build and integrate end-to-end life cycles of large-scale distributed machine learning systems i.e. ML Ops using cutting-edge tools/frameworks.
- Develop tools and services for explainability of ML solutions.
- Implement distributed cloud GPU training approaches for deep learning models.
- Build software/tools that improve the rate of experimentation for the research team and extract insights from it.
- Identify and evaluate new patterns and technologies to improve the performance maintainability and elegance of our machine learning systems.
- Contribute to and execute technical projects to completion. Collaborate with peers to develop requirements and monitor progress.
- Collaborate with engineers across various functions to solve complex data problems at scale.
Qualification:
- 5 - 8 years of professional experience in implementing MLOps framework to scale up ML in production.
- Masters degree or PhD in Computer Science Machine Learning / Deep Learning domains
Must-haves:
- Hands-on experience with orchestration and pipeline tools like Kubernetes Apache Airflow etc. and ML lifecycle management tools such as MLflow SageMaker or similar covering model training inference evaluation and deployment.
- Proficient in deploying ML models using frameworks like Ray Serve TorchServe TensorFlow Serving or NVIDIA Triton Inference Server.
- Strong foundation in ML model training frameworks such as PyTorch PyTorch Lightning TensorFlow etc.
- Experience leveraging GPU computing for parallel processing of data and model training.
- Solid software engineering skills with a track record of building production-grade systems.
- Advanced programming skills in Python.
- Proven experience in designing and implementing end-to-end data systems in roles like ML Engineer ML Platform Engineer or similar.
- Familiarity with cloud-based data processing tools and services such as AWS (S3 ECR Lambda) Spark Dask Elasticsearch Presto and SQL.
- Exposure to geospatial or remote sensing data is an added advantage.
Core Competencies:
- Strong debugging and critical thinking capabilities.
- Excellent analytical and problem-solving skills.
- Ability to thrive in fast-paced collaborative team environments.
Benefits:
- Medical Health Cover for you and your family including unlimited online doctor consultations
- Access to mental health experts for you and your family
- Dedicated allowances for learning and skill development
- Comprehensive leave policy with casual leaves paid leaves marriage leaves bereavement leaves
- Twice a year appraisal
Interview Process:
- Intro call
- Assessment
- Presentation
- Interview rounds (ideally up to 3-4 rounds)
- Culture Round / HR round