DevOps Engineer AIOps

São Paulo - Brazil

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

We are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning CI/CD automation telemetry pipelines and production deployment for AI-powered services agents and orchestration systems.

This is an SRE-heavy infrastructure-first role focused on ensuring AI systems operating in production are:

Reliable
Observable
Scalable
Secure
Cost-efficient
Safe to deploy and operate

You will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale.

Key Responsibilities

1. Infrastructure Provisioning & Automation

Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)
Provision and maintain Kubernetes clusters and supporting services
Automate environment setup across development staging and production
Manage networking IAM secrets storage and compute scaling
Ensure high availability resilience and disaster recovery readiness

2. CI/CD & Deployment Engineering

Build and maintain CI/CD pipelines for:
- AI services
- Agent frameworks
- Orchestrators
- Model artifacts
Implement automated testing and reliability validation gates
Enable blue/green and canary deployments
Build safe rollback mechanisms for services and models
Integrate reliability and health checks into deployment workflows

3. Model & Agent Deployment Governance

Package version and deploy models into containerized environments
Manage model artifact storage and promotion across environments
Monitor model performance and detect degradation
Support retraining cycle integration and model refresh workflows
Ensure safe rollout and rollback of model versions
Implement monitoring for inference latency throughput and cost

4. Data Pipelines for Telemetry & Observability

Design and maintain data pipelines to ingest clean and process high-volume telemetry (logs metrics traces events)
Enable structured telemetry for AI and orchestration workflows
Ensure reliability for real-time and batch processing
Optimize pipeline scalability and performance

5. AIOps Platform Integration

Evaluate deploy and integrate AIOps platforms
Improve anomaly detection correlation and alert intelligence
Reduce alert noise and improve signal quality
Integrate AIOps outputs into operational workflows and incident management

6. Intelligent Incident Automation

Automate incident detection and remediation workflows
Build self-healing scripts and intelligent runbooks
Reduce MTTD and MTTR through automation
Integrate AI-driven root cause analysis insights into operational tooling
Improve prevention of recurring incidents

7. Production Reliability & SRE Excellence

Define and manage SLIs SLOs and error budgets
Implement monitoring dashboards and alerting systems
Participate in on-call rotation
Lead incident triage and root cause analysis
Improve resilience scaling and failure handling
Implement circuit breakers rate limits and failover mechanisms

8. Security & Governance

Implement least-privilege access controls
Manage secrets and credential rotation
Enforce environment isolation
Ensure auditability and compliance for AI systems

Qualifications :

Required Experience

5 years of experience in Site Reliability Engineering DevOps or Platform Engineering roles
Strong hands-on experience with cloud platforms (AWS Azure or GCP)
Proven expertise with Kubernetes and containerized workloads
Experience with Infrastructure as Code (Terraform CloudFormation etc.)
Strong CI/CD implementation experience (GitHub Actions GitLab CI Jenkins etc.)
Experience building observability stacks (Prometheus Grafana OpenTelemetry ELK Datadog etc.)
Experience defining and managing SLIs/SLOs and error budgets
Hands-on experience with incident response and production support
Strong scripting skills (Python Bash or similar)

AI/ML Platform Experience (Strongly Preferred)

Experience deploying and managing AI/ML services in production
Familiarity with model packaging versioning and artifact management
Understanding of model lifecycle management and retraining workflows
Experience monitoring inference performance latency and cost
Exposure to AIOps tools and intelligent alerting systems

Additional Skills

Strong understanding of distributed systems reliability patterns
Knowledge of security best practices in cloud-native environments
Experience implementing high-availability and disaster recovery strategies
Excellent problem-solving and root cause analysis skills
Strong communication skills and ability to collaborate across engineering and AI teams

Additional Information :

Discover some of the global benefits that empower our people to become the best version of themselves:

Finance: Competitive salary package share plan company performance bonuses value-based recognition awards referral bonus;
Career Development: Career coaching global career opportunities non-linear career paths internal development programmes for management and technical leadership;
Learning Opportunities: Complex projects rotations internal tech communities training certifications coaching online learning platforms subscriptions pass-it-on sessions workshops conferences;
Work-Life Balance: Hybrid work and flexible working hours employee assistance programme;
Health: Global internal wellbeing programme access to wellbeing apps;
Community: Global internal tech communities hobby clubs and interest groups inclusion and diversity programmes events and celebrations.

At Endava were committed to creating an open inclusive and respectful environment where everyone feels safe valued and empowered to be their best. We welcome applications from people of all backgrounds experiences and perspectivesbecause we know that inclusive teams help us deliver smarter more innovative solutions for our customers. Hiring decisions are based on merit skills qualifications and potential. If you need adjustments or support during the recruitment process please let us know.

Remote Work :

Employment Type :

Full-time

We are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning CI/CD automation telemetry pipelines and production deployment for AI-powered services agents and orchestration systems.This is an SRE-heavy infrastructure-first role focused on...