Senior CloudDevOps Engineer (SDE 3)

SatSure


Job Location:

Bengaluru - India

Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

About SatSure
SatSure is a deep-tech decision intelligence company operating at the nexus of agriculture infrastructure and climate action. We turn earth observation data into actionable insights for governments financial institutions and enterprises across the developing world at scale with reliability.
Our platform team owns the infrastructure backbone that powers SatSures AI/ML products: multi-cloud Kubernetes clusters LLM inference pipelines geospatial data platforms and the internal developer tooling used by every engineering team. If you care about infrastructure quality and want your work to have real-world impact this is the role.
About the Role
We are looking for a Senior DevOps & MLOps Engineer to join our Platform & DevOps team. You will design build and operate cloud-native infrastructure that supports ML model serving data pipelines and developer platforms across AWS GCP and Azure. You will work closely with data science product engineering and security teams and be expected to own large surface areas end-to-end.
This is a hands-on senior IC role. You will architect systems write Terraform and Helm debug production incidents define SLOs and contribute to platform standards adopted org-wide.
Roles & Responsibilities
ML Platform & LLM Infrastructure
  • Own and operate Kubernetes-based ML platform on EKS supporting LLM inference (KServe) distributed compute (Dask/Ray) and workflow orchestration (Apache Airflow).
  • Partner with data science and ML teams to design deploy and scale ML workloads including GPU scheduling autoscaling resource isolation and SLO-driven reliability.
  • Architect deploy and optimize Ray clusters on Kubernetes for distributed ML workloads enabling scalable training batch inference and low-latency serving with efficient CPU/GPU utilization.
Multi-Cloud Platform & Infrastructure
  • Design build and maintain cloud-native infrastructure across AWS (primary) GCP and Azure using Kubernetes (EKS / GKE / AKS) Terraform Helm and ArgoCD.
  • Drive GitOps adoption and platform standardization define reusable infrastructure patterns Helm charts and deployment workflows used across all product teams.
  • Manage Kubernetes platform operations cluster lifecycle Karpenter-based autoscaling multi-tenancy and workload isolation for data science and engineering teams.
  • Implement and maintain service mesh (Istio) mTLS enforcement traffic policies and observability for inter-service communication.
  • Maintain and improve the internal developer platform (Backstage IDP) enabling self-service environments service catalog and onboarding workflows for engineering teams.
Observability & Reliability Engineering
  • Build and maintain full-stack observability infrastructure metrics (Prometheus / Mimir) logs (Loki) traces (Tempo) and dashboards (Grafana) integrated with OpenTelemetry instrumentation.
  • Define SLIs SLOs and error budget policies for production ML and platform services; lead incident response and post-mortem reviews.
  • Proactively identify reliability risks and drive engineering improvements to maintain 99.9% uptime targets.
FinOps & Cost Engineering
  • Implement Kubernetes cost attribution and chargeback using Kubecost / OpenCost driving per-team visibility and FinOps decision-making for AI infrastructure.
  • Continuously optimize cloud spend through workload right-sizing spot/preemptible usage and resource scheduling strategies.
Platform Security & Governance
  • Manage AWS multi-account governance using Control Tower SCPs GuardDuty and IAM Identity Center ensuring security posture across all environments.
  • Own OIDC identity and SSO infrastructure integrated across internal tooling Backstage Airflow and platform services.
  • Support compliance and audit processes ISO 27001 CIS Benchmarks Well-Architected Reviews and VAPT assessments.
Requirements
Must Have
  • 5 years of hands-on platform DevOps or SRE experience in production environments.
  • Strong Kubernetes expertise cluster operations Helm RBAC autoscaling (Karpenter / Cluster Autoscaler) multi-tenancy; EKS experience preferred.
  • Infrastructure as Code Terraform (advanced) Ansible; experience managing large multi-environment IaC codebases.
  • AWS expertise EC2 EKS S3 RDS IAM VPC CloudWatch Control Tower GuardDuty; GCP or Azure exposure is a plus.
  • GitOps & CI/CD ArgoCD Bitbucket Pipelines / Jenkins GitOps workflows at team scale.
  • Observability hands-on with Prometheus Grafana and at least one of: Loki Tempo OpenTelemetry Datadog or ELK.
  • Scripting & automation Python and Bash for tooling automation and platform integrations.
  • Strong understanding of networking security and cloud cost management in Kubernetes environments.
Nice to Have
  • Experience with ML serving infrastructure KServe vLLM Ray Serve or similar model serving frameworks.
  • Experience with Apache Airflow Dask or other data/ML pipeline orchestration at scale.
  • Familiarity with Backstage or similar internal developer platforms (IDP).
  • Istio or Envoy service mesh experience.
  • FinOps tooling Kubecost OpenCost or cloud provider cost management tools.
  • OIDC / identity provider experience (Zitadel Keycloak or similar).
  • AWS Certified Solutions Architect or equivalent cloud certification.
  • Exposure to geospatial data workloads or satellite imagery pipelines.
Minimum Qualification
  • Bachelors degree in Computer Science Information Technology or a related engineering discipline.
Our Stack
Kubernetes (EKS / GKE / AKS) AWS GCP Azure Terraform Helm ArgoCD Istio KServe Apache Airflow Dask Backstage IDP Prometheus Grafana Loki Tempo OpenTelemetry Kubecost Python Bash
Why SatSure
  • Real Production Scale: LLM inference geospatial data pipelines and multi-cloud Kubernetes not toy projects.
  • High Ownership: You architect systems end-to-end. No tickets-only culture no hand-holding required.
  • Meaningful Impact: Your infrastructure powers products used by governments and institutions across the developing world.
  • Growth & Benefits: Learning allowances broadband medical insurance best-in-class leave policy and hybrid work from Bengaluru.

Required Experience:

Senior IC

About SatSureSatSure is a deep-tech decision intelligence company operating at the nexus of agriculture infrastructure and climate action. We turn earth observation data into actionable insights for governments financial institutions and enterprises across the developing world at scale with reliabi...

About Company

Company Logo

SatSure is a deep tech, decision intelligence company. We leverage advances in satellite remote sensing, machine learning, big data analytics and cloud computing to create products and solutions which help enterprises and their people make smart decisions.

View Profile View Profile