AWS Observability Architect

Cloudious LLC


Job Location:

Warren, OH - USA

Monthly Salary: Not Disclosed
Posted on: 3 hours ago
Vacancies: 1 Vacancy

Job Summary

AWS Observability Architect

Location: Warren NJ (Onsite)

Fulltime ($160 - $170K PA) Dont stop profiles on Salary

We are seeking a highly skilled AWS Observability Architect with deep hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS - with Grafana as the primary observability tool and OpenTelemetry as the instrumentation standard. This is a technical specialist role requiring genuine implementation experience not platform familiarity.

The ideal candidate has personally architected and delivered large-scale observability solutions for production AWS environments - building telemetry pipelines designing dashboards that operations teams actually use and creating alerting frameworks that reduce MTTR rather than add noise. You understand the full observability stack: from application instrumentation with OpenTelemetry SDKs through to Grafana dashboards consumed by SREs on-call engineers and engineering leadership.

This role sits at the intersection of cloud infrastructure software engineering discipline and operational excellence - requiring someone who can design an enterprise observability architecture in the morning write a Grafana dashboard query in the afternoon and advise a development team on OpenTelemetry instrumentation strategy the next day.

Key Responsibilities

Observability Architecture & Strategy

  • Define and own the enterprise observability architecture for AWS environments - establishing the target-state design across the four pillars of observability: metrics logs traces and events.
  • Design end-to-end telemetry pipelines - from instrumentation at the application and infrastructure layer through collection processing storage and visualisation - with Grafana as the enterprise observability platform.
  • Develop observability standards and reference architectures - defining how AWS workloads across compute (EC2 EKS ECS Lambda) storage networking and managed services should be instrumented collected and visualised consistently across the organisation.
  • Establish signal-to-noise discipline across the observability platform - designing alerting frameworks that surface actionable signals eliminate false positives and ensure on-call engineers are alerted only when human intervention is genuinely required.
  • Define observability maturity roadmaps for client environments - assessing current-state coverage identifying gaps and building a phased improvement plan from reactive monitoring to proactive AIOps-ready observability.
  • Drive FinOps for observability - optimising telemetry data volumes retention policies and Grafana Enterprise licensing costs to ensure the observability platform itself does not become a significant cost centre.

Grafana Enterprise Implementation

  • Architect deploy and operate Grafana Enterprise or Grafana SaaS as the primary observability platform - including high-availability Grafana deployment on AWS (EKS-based or managed via Grafana Cloud) data source federation RBAC configuration and enterprise plugin management.
  • Design and implement Grafana data source integrations across the AWS observability ecosystem:
    • Amazon CloudWatch - metrics logs and alarms as a core AWS data source
    • Grafana Mimir - for scalable long-term Prometheus-compatible metrics storage
    • Grafana Loki - for cost-efficient label-based log aggregation at scale
    • Grafana Tempo - for distributed tracing storage and trace-to-log-to-metric correlation
    • Amazon Managed Service for Prometheus (AMP) - for AWS-native Prometheus metrics
    • Amazon OpenSearch - for log analytics and full-text search use cases
    • Elasticsearch / OpenSearch - for existing log infrastructure integration
  • Build and maintain a Grafana dashboard library - covering infrastructure health application performance SLO/SLA tracking capacity planning cost visibility incident response and executive reporting - using reusable variable-driven and consistently styled templates.
  • Implement Grafana alerting at enterprise scale - including alert routing notification policies silence management and integration with PagerDuty OpsGenie ServiceNow and Slack for multi-channel incident notification.
  • Configure Grafana RBAC and team structures - designing role hierarchies folder permissions and data source access controls that enable self-service dashboarding for development teams while protecting sensitive operational data.
  • Deploy and manage Grafana Oncall for on-call scheduling and alert routing or integrate Grafana alerting with existing incident management platforms.
  • Implement Grafana SLO (Service Level Objectives) - defining tracking and reporting error budgets across production services enabling data-driven reliability decisions.
  • Manage Grafana as code - using Grafanas provisioning capabilities (YAML/JSON) Terraform provider and Grizzly/Grafonnet for dashboard version control environment promotion and GitOps-based dashboard management.

OpenTelemetry Implementation

  • Define and lead the organisations OpenTelemetry (OTel) instrumentation strategy - establishing standards for automatic and manual instrumentation across application stacks running on AWS.
  • Design and deploy the OpenTelemetry Collector as the central telemetry processing layer - including:
    • Collector deployment patterns: agent (DaemonSet on EKS) gateway (centralised) and sidecar configurations
    • Receiver configuration - OTLP Prometheus Jaeger Zipkin AWS X-Ray CloudWatch Fluent Bit
    • Processor pipeline design - batch processing memory limiting attribute enrichment tail-based sampling and resource detection processors
    • Exporter configuration - routing telemetry to Grafana Mimir (metrics) Grafana Loki (logs) Grafana Tempo (traces) AMP and CloudWatch
  • Instrument AWS workloads with OpenTelemetry SDKs across languages (Java Python Go) - including auto-instrumentation for containerised EKS workloads Lambda instrumentation using OTel Lambda layers and ECS task definition instrumentation.
  • Implement distributed tracing using OpenTelemetry - establishing trace propagation standards across microservices configuring context propagation (W3C TraceContext B3) and ensuring end-to-end trace visibility from frontend to backend to database.
  • Design OTel-based log correlation - enriching logs with trace IDs and span IDs to enable trace-to-log navigation in Grafana supporting faster RCA during incidents.
  • Implement OTel-based metric instrumentation - defining custom business and application metrics alongside system metrics following OTel semantic conventions for consistent metric naming and attribute tagging across services.
  • Define sampling strategies for distributed traces - including head-based sampling for development environments and tail-based sampling (via OTel Collector) for production environments balancing observability coverage with storage cost.
  • Manage OTel Collector as infrastructure - including horizontal scaling resource limits high-availability deployment collector health monitoring and pipeline performance optimisation.

AWS Observability Services Integration

  • Design the integration architecture between AWS-native observability services and Grafana - positioning Grafana as the unified observability plane while leveraging AWS-native services as data sources:
    • Amazon CloudWatch - metrics logs alarms dashboards Contributor Insights and Synthetics
    • Amazon Managed Grafana (AMG) - evaluating and advising on AMG vs self-managed Grafana deployment decisions
    • Amazon Managed Service for Prometheus (AMP) - remote write from OTel Collector and Prometheus agents recording rules and alert manager integration
    • AWS X-Ray - ingesting X-Ray traces into Grafana Tempo or directly via Grafana X-Ray data source
    • AWS CloudTrail - audit log integration for security and compliance observability
    • VPC Flow Logs - network observability integration for security monitoring and traffic analysis
  • Implement infrastructure-level observability for core AWS services - EC2 (CloudWatch agent Node Exporter via OTel) EKS (kube-state-metrics cAdvisor OTel DaemonSet) RDS (Enhanced Monitoring Performance Insights) Lambda (OTel Lambda layer custom metrics) and API Gateway (access logs CloudWatch metrics).
  • Design business and synthetic monitoring - implementing Grafana Synthetic Monitoring or CloudWatch Synthetics for endpoint availability API health and user journey monitoring with Grafana alerting integration.

Delivery & Enablement

  • Lead observability implementation projects end-to-end - from requirements gathering and architecture design through deployment dashboard development alert tuning and team enablement.
  • Conduct observability maturity assessments for client environments - evaluating current monitoring coverage tool sprawl alert quality and SLO definition maturity and producing prioritised remediation roadmaps.
  • Develop and deliver observability enablement workshops for engineering and operations teams - covering OTel instrumentation Grafana dashboard development alert design and on-call best practices.
  • Produce observability architecture documentation - reference architectures runbooks onboarding guides and dashboard documentation that enable teams to self-serve and maintain the platform.
  • Advise on observability tool consolidation - helping organisations rationalise fragmented monitoring estates (Datadog New Relic Splunk Nagios Zabbix) toward a unified Grafana OTel platform including migration planning and cost impact analysis.

Experience

  • 10 years of overall experience in cloud infrastructure platform engineering or DevOps.
  • 5 years of hands-on AWS experience in production environments - not advisory or oversight roles.
  • 3 years of hands-on Grafana Enterprise or SaaS implementation experience - designing deploying and operating Grafana at enterprise scale including Mimir Loki Tempo and the LGTM stack.
  • Proven experience implementing OpenTelemetry in production environments - including OTel Collector deployment SDK-based instrumentation and distributed tracing implementation.
  • Demonstrated experience building production-grade observability pipelines - from instrumentation through collection processing storage and visualisation.
  • Hands-on experience with PromQL for metrics querying and alerting - including complex queries recording rules and alert expression design.
  • Experience with LogQL (Grafana Loki) for log querying and log-based alerting.
  • Hands-on experience deploying observability infrastructure on Kubernetes (EKS) - including Prometheus Operator OTel DaemonSets Grafana deployment and persistent storage configuration.
  • Experience with Grafana as code - provisioning dashboards data sources and alert rules via YAML Terraform or Grafonnet.

AWS Observability Architect Location: Warren NJ (Onsite) Fulltime ($160 - $170K PA) Dont stop profiles on Salary We are seeking a highly skilled AWS Observability Architect with deep hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS - with Graf...