We are looking for a Machine Learning Operations Engineer to join our team to design build and integrate ML Ops for largescale distributed machine learning systems focusing on cuttingedge tools distributed GPU training and enhancing research experimentation.
About SatSure:
SatSure is a deep tech decision Intelligence company that works primarily at the nexus of agriculture infrastructure and climate action creating an impact for the other millions focusing on the developing world. We want to make insights from Earth observation data accessible to all.
Join us to be at the forefront of building a deep tech company from India that solves problems for the globe.
Roles & Responsibilities:
- Architect build and integrate endtoend life cycles of largescale distributed machine learning systems i.e. ML Ops using cuttingedge tools/frameworks.
- Develop tools and services for the explainability of ML solutions.
- Implement distributed cloud GPU training approaches for deep learning models.
- Build software/tools that improve the rate of experimentation for the research team and extract insights from it.
- Identify and evaluate new patterns and technologies to improve the performance maintainability and elegance of our machine learning systems.
- Lead and execute technical projects to completion. Communicate with peers to build requirements and track progress.
- Mentor fellow engineers in your areas of expertise Contribute to a team culture that values effective collaboration technical excellence and innovation.
- Collaborate with engineers across various functions to solve complex data problems at scale.
Qualification:
- 5 8 years of professional experience in implementing the MLOps framework to scale up ML in production.
- Masters degree or PhD in Computer Science Machine Learning / Deep Learning domains
Musthave:
- Handson experience with Kubernetes Kubeflow MLflow Sagemaker and other ML model experiment management tools including training inference and evaluation.
- Experience in ML model serving (TorchServe TensorFlow Serving NVIDIA Triton inference server etc.)
- Proficiency with ML model training frameworks (PyTorch PyTorch Lightning Tensorflow etc.).
- Experience with GPU computing to do data and model training parallelism.
- Solid software engineering skills in developing systems for production.
- Strong expertise in Python.
- Building endtoend data systems as an ML Engineer Platform Engineer or equivalent.
- Experience working with cloud data processing technologies (S3 ECR Lambda AWS Spark Dask ElasticSearch Presto SQL etc.).
- Having Geospatial / Remote sensing experience is a plus.
- Competencies:
- Excellent debugging and critical thinking skills.
- Excellent analytical and problemsolving skills.
- Ability to work in a fastpaced teambased environment.
Benefits:
- Medical Health Cover for you and your family including unlimited online doctor consultations
- Access to mental health experts for you and your family
- Dedicated allowances for learning and skill development
- Comprehensive leave policy with casual leaves paid leaves marriage leaves bereavement leaves
- Twice a year appraisal
Interview Process:
- Intro call
- Assessment
- Interview rounds (ideally up to 34 rounds)
- Culture Round / HR round