982 SRE Engineer Senior LATAM

Darwoft

Job Location:

Córdoba - Argentina

Monthly Salary: Not Disclosed

Posted on: 10 hours ago

Vacancies: 1 Vacancy

Job Summary

Disclaimer Must read: Commitment & Focus This role requires full-time dedication with clear priority given to Darwoft projects during the established working hours. It is not compatible with other full-time professional engagements. Any additional professional activities must be disclosed in advance and must not interfere with the responsibilities or working hours of this role.

About Darwoft

Darwoft is a software factory that develops custom software solutions and provides IT staff augmentation services for international clients primarily in the United States and Latin America. We work with startups and high-growth companies to build high-impact digital products. Our culture is people-first focused on technical quality long-term relationships and a collaborative mindset. We combine technical excellence with human proximity.

Senior Site Reliability Engineer (AI Platform & Observability) Contractor Global

General Information

Location: Remote (Global)
Contract Type: Contractor
Industry / Project: AI Infrastructure & Platform Operations
Time Zone: Coordination with US / LATAM teams
English Level: Advanced (C1)

About the Role

We are seeking a Senior Site Reliability Engineer with a deep focus on observability and AI platform operations. This role sits at the intersection of reliability engineering and emerging AI infrastructure. You will own the instrumentation visibility and operational health of AI-powered systems including LLM API gateways token usage pipelines and model serving infrastructure.

You will act as the authority on runtime behavior across our AI stack building the tooling and insights required to understand measure and optimize system performance reliability and cost.

Responsibilities

Design and operate AI gateway infrastructure including routing rate limiting and traffic shaping for LLM API traffic.
Build and maintain deep observability into AI workloads: token consumption model latency cost attribution and error rates by model team and use case.
Define and track SLIs SLOs and error budgets for AI services and API-dependent workflows.
Instrument LLM-backed applications to surface prompt/completion telemetry retry patterns and quota burn rates.
Develop dashboards and alerting using Grafana Loki and Prometheus tailored to AI traffic patterns (beyond traditional infrastructure metrics).
Maintain and evolve observability pipelines capable of handling high-cardinality AI metadata.
Lead incident response for AI platform degradations including model unavailability gateway saturation and upstream provider outages.
Automate operational workflows across AI infrastructure using Infrastructure as Code (IaC) and CI/CD practices.
Collaborate closely with ML/AI engineering teams to embed reliability and cost-visibility practices early in the development lifecycle.

Requirements

Must-Have:

Strong experience with AWS cloud services specifically those relevant to AI workloads (Bedrock SageMaker Lambda API Gateway).
Hands-on expertise with Kubernetes in production environments.
Proven experience building and operating observability stacks (Prometheus Grafana Loki) with an emphasis on application- and API-layer metrics.
Solid understanding of API gateway patterns including routing throttling authentication and traffic observability.
Experience instrumenting and monitoring LLM or AI API usage (token budgets cost tracking latency profiling).
Proficiency in Python Go or Bash for automation and tooling.
Mastery of Infrastructure as Code (Terraform) and CI/CD pipelines.
Strong analytical mindset with the ability to extract signal from high-cardinality telemetry.

Nice-to-Have:

Experience operating or integrating AI gateway solutions (e.g. Kong AI Gateway Portkey LiteLLM).
Familiarity with OpenTelemetry and distributed tracing for AI/ML workloads.
Experience with FinOps practices for AI including chargeback models and cost anomaly detection.
Knowledge of service mesh technologies and their role in AI traffic management.

What We Offer (Contractor)

Contractor agreement with payment in USD.
100% remote work in an international environment.
Access to Argentine public holidays.
Professional development in the cutting-edge field of AI Platform Engineering.
Referral program and access to learning platforms.
English classes to further enhance professional communication.

Explore this and other opportunities at:

About Darwoft

Senior Site Reliability Engineer (AI Platform & Observability) Contractor Global

General Information

Location: Remote (Global)
Contract Type: Contractor
Industry / Project: AI Infrastructure & Platform Operations
Time Zone: Coordination with US / LATAM teams
English Level: Advanced (C1)

About the Role

You will act as the authority on runtime behavior across our AI stack building the tooling and insights required to understand measure and optimize system performance reliability and cost.

Responsibilities

Design and operate AI gateway infrastructure including routing rate limiting and traffic shaping for LLM API traffic.
Build and maintain deep observability into AI workloads: token consumption model latency cost attribution and error rates by model team and use case.
Define and track SLIs SLOs and error budgets for AI services and API-dependent workflows.
Instrument LLM-backed applications to surface prompt/completion telemetry retry patterns and quota burn rates.
Develop dashboards and alerting using Grafana Loki and Prometheus tailored to AI traffic patterns (beyond traditional infrastructure metrics).
Maintain and evolve observability pipelines capable of handling high-cardinality AI metadata.
Lead incident response for AI platform degradations including model unavailability gateway saturation and upstream provider outages.
Automate operational workflows across AI infrastructure using Infrastructure as Code (IaC) and CI/CD practices.
Collaborate closely with ML/AI engineering teams to embed reliability and cost-visibility practices early in the development lifecycle.

Requirements

Must-Have:

Strong experience with AWS cloud services specifically those relevant to AI workloads (Bedrock SageMaker Lambda API Gateway).
Hands-on expertise with Kubernetes in production environments.
Proven experience building and operating observability stacks (Prometheus Grafana Loki) with an emphasis on application- and API-layer metrics.
Solid understanding of API gateway patterns including routing throttling authentication and traffic observability.
Experience instrumenting and monitoring LLM or AI API usage (token budgets cost tracking latency profiling).
Proficiency in Python Go or Bash for automation and tooling.
Mastery of Infrastructure as Code (Terraform) and CI/CD pipelines.
Strong analytical mindset with the ability to extract signal from high-cardinality telemetry.

Nice-to-Have:

Experience operating or integrating AI gateway solutions (e.g. Kong AI Gateway Portkey LiteLLM).
Familiarity with OpenTelemetry and distributed tracing for AI/ML workloads.
Experience with FinOps practices for AI including chargeback models and cost anomaly detection.
Knowledge of service mesh technologies and their role in AI traffic management.

What We Offer (Contractor)

Contractor agreement with payment in USD.
100% remote work in an international environment.
Access to Argentine public holidays.
Professional development in the cutting-edge field of AI Platform Engineering.
Referral program and access to learning platforms.
English classes to further enhance professional communication.

Explore this and other opportunities at:

Apply Now

About Company

Darwoft

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

982 SRE Engineer Senior LATAM

Córdoba - Argentina

Job Summary

About Darwoft

Senior Site Reliability Engineer (AI Platform & Observability) Contractor Global

General Information

About the Role

Responsibilities

Requirements

What We Offer (Contractor)

About Darwoft

Senior Site Reliability Engineer (AI Platform & Observability) Contractor Global

General Information

About the Role

Responsibilities

Requirements

What We Offer (Contractor)

About Company

Related Jobs