Computer Scientist 2 Infra Engineering
Job Summary
Computer Scientist
Location: Noida
Experience: 69 Years
Team: AI Platform Engineering
Role Overview
We are looking for an experienced Infrastructure Developer (69 years) to help design build and scale the platform that powers our most demanding ML training workloads. This is a hands-on engineering role where you will write production-grade code drive meaningful technical initiatives and contribute to the reliability of an infrastructure that thousands of GPU hours depend on every day.
You bring strong Kubernetes skills solid networking fundamentals a developers mindset and the ability to own projects end-to-end with limited supervision. You have operated systems at significant scale and are ready to step up into broader technical leadership.
About the Platform
You will be working on a cutting-edge platform designed to train and serve large-scale machine learning models. The platform supports everything from small-scale experimentation to large distributed training jobs running on GPU clusters with thousands of accelerators. It provides ML engineers and researchers with the tools to onboard monitor and scale their workloads whether a lightweight prototype or a production-grade deep learning model powering real-world applications.
Key platform capabilities:
Dynamic GPU orchestration using Kubernetes with custom schedulers and resource topology awareness.
Training & inference workflows end-to-end pipeline support from data ingestion through model serving.
Observability & cost tracking full-stack visibility across compute network and storage layers.
Self-service developer tooling enabling high-velocity experimentation without platform bottlenecks.
Multi-cloud infrastructure primarily AWS with Azure/GCP expansion underway.
Your contributions will directly influence the reliability scalability and efficiency of this platform and the speed at which AI teams can innovate.
What Youll Do
Build for scale Design and improve Kubernetes-native infrastructure that runs distributed GPU training jobs reliably and efficiently. You will own significant components and drive their evolution.
Lead focused initiatives Own meaningful projects end-to-end write design docs gather input from stakeholders and deliver under realistic timelines often collaborating with engineers across time zones.
Codify infrastructure Define and ship cloud infrastructure through IaC (Terraform/Pulumi). Apply the same rigor testing and review discipline to infra changes as to application code.
Strengthen observability Contribute to and extend deep observability stacks metrics distributed tracing log aggregation SLO/SLI frameworks that surface problems before they become incidents.
Write production code Build automation internal tooling operators and platform services in Go Python or Rust. This is not a YAML-only role.
Own reliability Participate in incident response post-mortems and reliability reviews. Drive systemic fixes not just workarounds. Be a strong contributor to on-call culture.
Solve hard networking problems Debug and resolve complex cluster networking issues CNI BGP service mesh DNS at scale east-west traffic throughput tuning.
Mentor and grow Raise the technical bar through code reviews design feedback and knowledge sharing with peers and more junior engineers.
What You Bring
Core Requirements
Kubernetes & GPU Infrastructure
69 years in SRE platform engineering or infrastructure roles
Strong working knowledge of Kubernetes internals: scheduler kubelet CRDs operators admission controllers
Hands-on experience running GPU/accelerator training workloads in production
Familiarity with multi-cluster management and workload placement strategies
Helm Kustomize GitOps (Flux/ArgoCD) practical experience and good judgment on when to use them
Cloud & Infrastructure as Code
Solid hands-on AWS experience (VPC EKS EC2 S3 IAM; TGW a plus)
Production experience with Terraform or Pulumi modular and tested
CI/CD for infrastructure: drift detection plan gating rollback strategies
Working understanding of cost optimization reserved capacity and spot instance management
Observability
Prometheus Grafana AlertManager production experience not just lab setups
Exposure to distributed tracing: OpenTelemetry Jaeger or Tempo
Log aggregation: Loki Elasticsearch/OpenSearch
Comfort with SLO/SLI design error budgets and multi-tier alerting
Networking Fundamentals
Strong TCP/IP DNS TLS HTTP/2 gRPC fundamentals
Practical experience with CNI plugins: Cilium Calico or Flannel and their trade-offs
Familiarity with service mesh (Istio/Linkerd) ingress controllers and API gateways
Ability to debug under load: packet captures eBPF traces kernel counters
Coding & System Design
Production-quality code in Go Python or Rust you ship not just script
Solid grasp of distributed systems fundamentals: consistency availability failure modes
Experience writing Kubernetes operators or working with controller-runtime patterns
Engaged code reviewer thoughtful constructive and consistent
Clear technical writer: design docs ADRs runbooks that others can actually use
Collaboration & Ownership
Has delivered meaningful cross-functional projects from design to production
Comfortable with ambiguity can break down a problem and make progress without a perfect spec
Experience working async across distributed teams and time zones
Strong communicator can explain infra trade-offs clearly to peers and partner teams
Self-driven identifies problems proposes solutions and follows through to outcomes
Bonus Points
Azure / GCP hands-on experience
Familiarity with ML training pipeline internals
eBPF-based observability or networking
Chaos engineering or game day participation
Open-source infrastructure contributions
Security compliance or audit exposure
Why This Role
You will write software not just YAML. This is a coding role as much as it is an operations role.
You will work on real AI infrastructure challenges the kind that research papers get written about not buzzword slide decks.
You will see your impact across developer productivity platform scalability and service reliability.
You will grow. This role gives you room to step into broader technical leadership over time.
You will join a team that values code quality systems thinking blameless culture and genuine ownership.
You will work on systems at a scale most engineers never get to touch thousands of GPUs petabytes of data movement milliseconds of scheduling latency that matter.
If you have built and operated real infrastructure care about doing it well and are ready to take on broader scope we want to talk.
About Adobe
Adobe empowers everyone to create through innovative platforms and tools that unleash creativity productivity and personalized customer experiences. Adobes industry-leading offerings including Adobe Acrobat Studio Adobe Express Adobe Firefly Creative Cloud Adobe Experience Platform Adobe Experience Manager and GenStudio enable people and businesses to turn ideas into impact powered by AI and driven by human ingenuity.
Our 30000 employees worldwide are creating the future and raising the bar as we drive the next decade of growth. Were on a mission to hire the very best and believe in creating a company culture where all employees are empowered to make an impact. At Adobe we believe that great ideas can come from anywhere in the organization. The next big idea could be yours.
Lets Adobe together
At Adobe we believe in creating a company culture where all employees are empowered to make an impact. Learn more about Adobe life including our values and culture focus on people purpose and community Adobe for All comprehensive benefits programs the stories we tell the customers we serve and how you can help us advance our mission of empowering everyone to create.
Adobe is proud to be an Equal Employment Opportunity employer. We do not discriminate based on gender race or color ethnicity or national origin age disability religion sexual orientation gender identity or expression veteran status or any other protected characteristic. Learn more.
Adobe aims to make our Careers website and recruiting process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process email or call 1 .
AI Use Guidelines for Interviews:
Our interviews are designed to reflect your own skills and thinking. The use of AI or recording tools during live interviews is not permitted unless explicitly invited by the interviewer or approved in advance as part of a reasonable accommodation. If these tools are used inappropriately or in a way that misrepresents your work your application may not move forward in the process.
At Adobe we empower employees to innovate with AI and we look for candidates eager to do the same. As part of the hiring experience we provide clear guidance on where AI is encouraged during the process and where its restricted during live interviews. See how we think about AI in the hiring experience.
Required Experience:
IC