Platform Reliability Engineer (Agentic AI)

Search Atlas

Job Location:

São Paulo - Brazil

Monthly Salary: m 5500 - 6500

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

The Mission: Building the Autonomous Nervous System

Search Atlas is moving beyond suggestions to full execution.

Our agent Atlas Brain handles SEO AEO Google Ads and AI Content Generation autonomouslyzero manual intervention.

While Platform Engineers build self-service tools for developers you ensure those tools enable autonomous AI execution with 99.99% reliability. Youre not keeping dashboards alive; youre building the engine that allows an AI Agent to replace manual marketing execution. If the platform is reliable the agent is unstoppable.

What You Will Do:

Architect the Autonomous Backbone

Design and maintain the Kubernetes-based platform (EKS/GKE) that hosts Atlas Brain and its distributed agentic workershandling millions of requests across SEO crawling content generation and ad optimization pipelines.

Engineer for Zero-Touch

Automate every aspect of infrastructure using Terraform ArgoCD and Go/Python. If you have to do it twice it must be a script. Enable true zero manual execution at the infrastructure level.

Scale Agentic Workflows

Optimize ML inference pipelines for real-time agent decision-making
Architect high-concurrency crawling systems that feed Atlas Brains intelligence
Ensure sub-second latency for agent task execution (SEO Content AI Builder)
Handle high-frequency data pipelines: real-time bidding SERP monitoring content generation at scale

Define Radical Reliability for AI

Establish SLOs/SLIs specifically for AI execution success rates and agent task completion not just uptime. Design self-healing systems that preemptively resolve failures before they impact autonomous workflows.

Observability for Agent Decisions

Build distributed tracing and monitoring for complex agentic interactionstrace agent decision trees across SEO/AEO/Ads workflows enabling rapid diagnosis of why the agent made that choice. Implement OpenTelemetry Prometheus and Grafana for full visibility into autonomous execution.

Safety & Guardrails

Implement guardrails and safety controls for autonomous agent execution in marketing contextsensuring AI actions align with business rules budget constraints and compliance requirements. Design human-in-the-loop escalation paths for edge cases.

Cost & Performance Governance

Proactively optimize cloud spend and resource allocation (Karpenter/KEDA) as we scale to thousands of agencies. Balance performance with cost efficiency for unpredictable AI workloads.

Technical Requirements

Experience: 6 years in Platform Engineering SRE or Infrastructure roles within high-growth SaaS environmentswith proven experience supporting AI/ML systems at scale.

Infrastructure as Code: Mastery of Terraform ArgoCD and GitOps workflows.

Container Orchestration: Expert-level Kubernetes (EKS/GKE) networking scaling security and multi-tenancy patterns.

MLOps for Agents (Must-Have):

Hands-on experience with MLOps pipelines for autonomous agents
Model versioning and deployment strategies for continuous agent improvement
Prompt management and A/B testing of agent behaviors
Guardrails for safe tool execution and decision boundaries
Scaling AI inference services (LLMs embeddings classification models)

Languages: Proficiency in Python for building custom platform tools and automation.

Observability: Deep expertise in distributed tracing and monitoring for complex event-driven systemsspecifically for debugging AI agent decision chains.

Data-Intensive Systems: Experience with high-frequency data pipelines web crawling at scale real-time processing and low-latency requirements.

Why This Is Different

Unlike traditional SRE roles focused on keeping services up youre building the infrastructure that enables autonomous AI to execute business-critical marketing tasks. Every millisecond of latency you eliminate every self-healing mechanism you deploy directly impacts whether Atlas Brain can truly replace manual agency work.

This is not traditional SREyoure building the autonomous nervous system for AI execution.

What Success Looks Like

Atlas Brain executes millions of marketing tasks daily with <0.1% failure rate
Zero infrastructure-related incidents requiring manual intervention during business hours
Platform scales from hundreds to thousands of agency clients without reliability degradation
Complete observability into agent behavior: We know not just that the agent acted but why

Ready to build the platform that makes autonomous marketing execution a reality

The Mission: Building the Autonomous Nervous SystemSearch Atlas is moving beyond suggestions to full execution.Our agent Atlas Brain handles SEO AEO Google Ads and AI Content Generation autonomouslyzero manual intervention.While Platform Engineers build self-service tools for developers you ensure t...

The Mission: Building the Autonomous Nervous System

Search Atlas is moving beyond suggestions to full execution.

Our agent Atlas Brain handles SEO AEO Google Ads and AI Content Generation autonomouslyzero manual intervention.

What You Will Do:

Architect the Autonomous Backbone

Engineer for Zero-Touch

Automate every aspect of infrastructure using Terraform ArgoCD and Go/Python. If you have to do it twice it must be a script. Enable true zero manual execution at the infrastructure level.

Scale Agentic Workflows

Optimize ML inference pipelines for real-time agent decision-making
Architect high-concurrency crawling systems that feed Atlas Brains intelligence
Ensure sub-second latency for agent task execution (SEO Content AI Builder)
Handle high-frequency data pipelines: real-time bidding SERP monitoring content generation at scale

Define Radical Reliability for AI

Observability for Agent Decisions

Safety & Guardrails

Cost & Performance Governance

Proactively optimize cloud spend and resource allocation (Karpenter/KEDA) as we scale to thousands of agencies. Balance performance with cost efficiency for unpredictable AI workloads.

Technical Requirements

Experience: 6 years in Platform Engineering SRE or Infrastructure roles within high-growth SaaS environmentswith proven experience supporting AI/ML systems at scale.

Infrastructure as Code: Mastery of Terraform ArgoCD and GitOps workflows.

Container Orchestration: Expert-level Kubernetes (EKS/GKE) networking scaling security and multi-tenancy patterns.

MLOps for Agents (Must-Have):

Hands-on experience with MLOps pipelines for autonomous agents
Model versioning and deployment strategies for continuous agent improvement
Prompt management and A/B testing of agent behaviors
Guardrails for safe tool execution and decision boundaries
Scaling AI inference services (LLMs embeddings classification models)

Languages: Proficiency in Python for building custom platform tools and automation.

Observability: Deep expertise in distributed tracing and monitoring for complex event-driven systemsspecifically for debugging AI agent decision chains.

Data-Intensive Systems: Experience with high-frequency data pipelines web crawling at scale real-time processing and low-latency requirements.

Why This Is Different

This is not traditional SREyoure building the autonomous nervous system for AI execution.

What Success Looks Like

Atlas Brain executes millions of marketing tasks daily with <0.1% failure rate
Zero infrastructure-related incidents requiring manual intervention during business hours
Platform scales from hundreds to thousands of agency clients without reliability degradation
Complete observability into agent behavior: We know not just that the agent acted but why

Ready to build the platform that makes autonomous marketing execution a reality

Apply Now

About Company

Search Atlas

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Platform Reliability Engineer (Agentic AI)

São Paulo - Brazil

Job Summary

The Mission: Building the Autonomous Nervous System

What You Will Do:

Architect the Autonomous Backbone

Engineer for Zero-Touch

Scale Agentic Workflows

Define Radical Reliability for AI

Observability for Agent Decisions

Safety & Guardrails

Cost & Performance Governance

Technical Requirements

Why This Is Different

What Success Looks Like

Ready to build the platform that makes autonomous marketing execution a reality

The Mission: Building the Autonomous Nervous System

What You Will Do:

Architect the Autonomous Backbone

Engineer for Zero-Touch

Scale Agentic Workflows

Define Radical Reliability for AI

Observability for Agent Decisions

Safety & Guardrails

Cost & Performance Governance

Technical Requirements

Why This Is Different

What Success Looks Like

Ready to build the platform that makes autonomous marketing execution a reality

About Company

Related Jobs