Fulltime ($160 - $170K PA) Dont stop profiles on Salary
We are seeking a highly skilled AWS Observability Architect with deep hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS - with Grafana as the primary observability tool and OpenTelemetry as the instrumentation standard. This is a technical specialist role requiring genuine implementation experience not platform familiarity.
The ideal candidate has personally architected and delivered large-scale observability solutions for production AWS environments - building telemetry pipelines designing dashboards that operations teams actually use and creating alerting frameworks that reduce MTTR rather than add noise. You understand the full observability stack: from application instrumentation with OpenTelemetry SDKs through to Grafana dashboards consumed by SREs on-call engineers and engineering leadership.
This role sits at the intersection of cloud infrastructure software engineering discipline and operational excellence - requiring someone who can design an enterprise observability architecture in the morning write a Grafana dashboard query in the afternoon and advise a development team on OpenTelemetry instrumentation strategy the next day.
Key Responsibilities
Observability Architecture & Strategy
Define and own the enterprise observability architecture for AWS environments - establishing the target-state design across the four pillars of observability: metrics logs traces and events.
Design end-to-end telemetry pipelines - from instrumentation at the application and infrastructure layer through collection processing storage and visualisation - with Grafana as the enterprise observability platform.
Develop observability standards and reference architectures - defining how AWS workloads across compute (EC2 EKS ECS Lambda) storage networking and managed services should be instrumented collected and visualised consistently across the organisation.
Establish signal-to-noise discipline across the observability platform - designing alerting frameworks that surface actionable signals eliminate false positives and ensure on-call engineers are alerted only when human intervention is genuinely required.
Define observability maturity roadmaps for client environments - assessing current-state coverage identifying gaps and building a phased improvement plan from reactive monitoring to proactive AIOps-ready observability.
Drive FinOps for observability - optimising telemetry data volumes retention policies and Grafana Enterprise licensing costs to ensure the observability platform itself does not become a significant cost centre.
Grafana Enterprise Implementation
Architect deploy and operate Grafana Enterprise or Grafana SaaS as the primary observability platform - including high-availability Grafana deployment on AWS (EKS-based or managed via Grafana Cloud) data source federation RBAC configuration and enterprise plugin management.
Design and implement Grafana data source integrations across the AWS observability ecosystem:
Amazon CloudWatch - metrics logs and alarms as a core AWS data source
Grafana Mimir - for scalable long-term Prometheus-compatible metrics storage
Grafana Loki - for cost-efficient label-based log aggregation at scale
Grafana Tempo - for distributed tracing storage and trace-to-log-to-metric correlation
Amazon Managed Service for Prometheus (AMP) - for AWS-native Prometheus metrics
Amazon OpenSearch - for log analytics and full-text search use cases
Elasticsearch / OpenSearch - for existing log infrastructure integration
Build and maintain a Grafana dashboard library - covering infrastructure health application performance SLO/SLA tracking capacity planning cost visibility incident response and executive reporting - using reusable variable-driven and consistently styled templates.
Implement Grafana alerting at enterprise scale - including alert routing notification policies silence management and integration with PagerDuty OpsGenie ServiceNow and Slack for multi-channel incident notification.
Configure Grafana RBAC and team structures - designing role hierarchies folder permissions and data source access controls that enable self-service dashboarding for development teams while protecting sensitive operational data.
Deploy and manage Grafana Oncall for on-call scheduling and alert routing or integrate Grafana alerting with existing incident management platforms.
Implement Grafana SLO (Service Level Objectives) - defining tracking and reporting error budgets across production services enabling data-driven reliability decisions.
Manage Grafana as code - using Grafanas provisioning capabilities (YAML/JSON) Terraform provider and Grizzly/Grafonnet for dashboard version control environment promotion and GitOps-based dashboard management.
OpenTelemetry Implementation
Define and lead the organisations OpenTelemetry (OTel) instrumentation strategy - establishing standards for automatic and manual instrumentation across application stacks running on AWS.
Design and deploy the OpenTelemetry Collector as the central telemetry processing layer - including:
Collector deployment patterns: agent (DaemonSet on EKS) gateway (centralised) and sidecar configurations
Exporter configuration - routing telemetry to Grafana Mimir (metrics) Grafana Loki (logs) Grafana Tempo (traces) AMP and CloudWatch
Instrument AWS workloads with OpenTelemetry SDKs across languages (Java Python Go) - including auto-instrumentation for containerised EKS workloads Lambda instrumentation using OTel Lambda layers and ECS task definition instrumentation.
Implement distributed tracing using OpenTelemetry - establishing trace propagation standards across microservices configuring context propagation (W3C TraceContext B3) and ensuring end-to-end trace visibility from frontend to backend to database.
Design OTel-based log correlation - enriching logs with trace IDs and span IDs to enable trace-to-log navigation in Grafana supporting faster RCA during incidents.
Implement OTel-based metric instrumentation - defining custom business and application metrics alongside system metrics following OTel semantic conventions for consistent metric naming and attribute tagging across services.
Define sampling strategies for distributed traces - including head-based sampling for development environments and tail-based sampling (via OTel Collector) for production environments balancing observability coverage with storage cost.
Manage OTel Collector as infrastructure - including horizontal scaling resource limits high-availability deployment collector health monitoring and pipeline performance optimisation.
AWS Observability Services Integration
Design the integration architecture between AWS-native observability services and Grafana - positioning Grafana as the unified observability plane while leveraging AWS-native services as data sources:
Amazon Managed Grafana (AMG) - evaluating and advising on AMG vs self-managed Grafana deployment decisions
Amazon Managed Service for Prometheus (AMP) - remote write from OTel Collector and Prometheus agents recording rules and alert manager integration
AWS X-Ray - ingesting X-Ray traces into Grafana Tempo or directly via Grafana X-Ray data source
AWS CloudTrail - audit log integration for security and compliance observability
VPC Flow Logs - network observability integration for security monitoring and traffic analysis
Implement infrastructure-level observability for core AWS services - EC2 (CloudWatch agent Node Exporter via OTel) EKS (kube-state-metrics cAdvisor OTel DaemonSet) RDS (Enhanced Monitoring Performance Insights) Lambda (OTel Lambda layer custom metrics) and API Gateway (access logs CloudWatch metrics).
Design business and synthetic monitoring - implementing Grafana Synthetic Monitoring or CloudWatch Synthetics for endpoint availability API health and user journey monitoring with Grafana alerting integration.
Delivery & Enablement
Lead observability implementation projects end-to-end - from requirements gathering and architecture design through deployment dashboard development alert tuning and team enablement.
Conduct observability maturity assessments for client environments - evaluating current monitoring coverage tool sprawl alert quality and SLO definition maturity and producing prioritised remediation roadmaps.
Develop and deliver observability enablement workshops for engineering and operations teams - covering OTel instrumentation Grafana dashboard development alert design and on-call best practices.
Produce observability architecture documentation - reference architectures runbooks onboarding guides and dashboard documentation that enable teams to self-serve and maintain the platform.
Advise on observability tool consolidation - helping organisations rationalise fragmented monitoring estates (Datadog New Relic Splunk Nagios Zabbix) toward a unified Grafana OTel platform including migration planning and cost impact analysis.
Experience
10 years of overall experience in cloud infrastructure platform engineering or DevOps.
5 years of hands-on AWS experience in production environments - not advisory or oversight roles.
3 years of hands-on Grafana Enterprise or SaaS implementation experience - designing deploying and operating Grafana at enterprise scale including Mimir Loki Tempo and the LGTM stack.
Proven experience implementing OpenTelemetry in production environments - including OTel Collector deployment SDK-based instrumentation and distributed tracing implementation.
Demonstrated experience building production-grade observability pipelines - from instrumentation through collection processing storage and visualisation.
Hands-on experience with PromQL for metrics querying and alerting - including complex queries recording rules and alert expression design.
Experience with LogQL (Grafana Loki) for log querying and log-based alerting.
Hands-on experience deploying observability infrastructure on Kubernetes (EKS) - including Prometheus Operator OTel DaemonSets Grafana deployment and persistent storage configuration.
Experience with Grafana as code - provisioning dashboards data sources and alert rules via YAML Terraform or Grafonnet.
AWS Observability Architect Location: Warren NJ (Onsite) Fulltime ($160 - $170K PA) Dont stop profiles on Salary We are seeking a highly skilled AWS Observability Architect with deep hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS - with Graf...
AWS Observability Architect
Location: Warren NJ (Onsite)
Fulltime ($160 - $170K PA) Dont stop profiles on Salary
We are seeking a highly skilled AWS Observability Architect with deep hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS - with Grafana as the primary observability tool and OpenTelemetry as the instrumentation standard. This is a technical specialist role requiring genuine implementation experience not platform familiarity.
The ideal candidate has personally architected and delivered large-scale observability solutions for production AWS environments - building telemetry pipelines designing dashboards that operations teams actually use and creating alerting frameworks that reduce MTTR rather than add noise. You understand the full observability stack: from application instrumentation with OpenTelemetry SDKs through to Grafana dashboards consumed by SREs on-call engineers and engineering leadership.
This role sits at the intersection of cloud infrastructure software engineering discipline and operational excellence - requiring someone who can design an enterprise observability architecture in the morning write a Grafana dashboard query in the afternoon and advise a development team on OpenTelemetry instrumentation strategy the next day.
Key Responsibilities
Observability Architecture & Strategy
Define and own the enterprise observability architecture for AWS environments - establishing the target-state design across the four pillars of observability: metrics logs traces and events.
Design end-to-end telemetry pipelines - from instrumentation at the application and infrastructure layer through collection processing storage and visualisation - with Grafana as the enterprise observability platform.
Develop observability standards and reference architectures - defining how AWS workloads across compute (EC2 EKS ECS Lambda) storage networking and managed services should be instrumented collected and visualised consistently across the organisation.
Establish signal-to-noise discipline across the observability platform - designing alerting frameworks that surface actionable signals eliminate false positives and ensure on-call engineers are alerted only when human intervention is genuinely required.
Define observability maturity roadmaps for client environments - assessing current-state coverage identifying gaps and building a phased improvement plan from reactive monitoring to proactive AIOps-ready observability.
Drive FinOps for observability - optimising telemetry data volumes retention policies and Grafana Enterprise licensing costs to ensure the observability platform itself does not become a significant cost centre.
Grafana Enterprise Implementation
Architect deploy and operate Grafana Enterprise or Grafana SaaS as the primary observability platform - including high-availability Grafana deployment on AWS (EKS-based or managed via Grafana Cloud) data source federation RBAC configuration and enterprise plugin management.
Design and implement Grafana data source integrations across the AWS observability ecosystem:
Amazon CloudWatch - metrics logs and alarms as a core AWS data source
Grafana Mimir - for scalable long-term Prometheus-compatible metrics storage
Grafana Loki - for cost-efficient label-based log aggregation at scale
Grafana Tempo - for distributed tracing storage and trace-to-log-to-metric correlation
Amazon Managed Service for Prometheus (AMP) - for AWS-native Prometheus metrics
Amazon OpenSearch - for log analytics and full-text search use cases
Elasticsearch / OpenSearch - for existing log infrastructure integration
Build and maintain a Grafana dashboard library - covering infrastructure health application performance SLO/SLA tracking capacity planning cost visibility incident response and executive reporting - using reusable variable-driven and consistently styled templates.
Implement Grafana alerting at enterprise scale - including alert routing notification policies silence management and integration with PagerDuty OpsGenie ServiceNow and Slack for multi-channel incident notification.
Configure Grafana RBAC and team structures - designing role hierarchies folder permissions and data source access controls that enable self-service dashboarding for development teams while protecting sensitive operational data.
Deploy and manage Grafana Oncall for on-call scheduling and alert routing or integrate Grafana alerting with existing incident management platforms.
Implement Grafana SLO (Service Level Objectives) - defining tracking and reporting error budgets across production services enabling data-driven reliability decisions.
Manage Grafana as code - using Grafanas provisioning capabilities (YAML/JSON) Terraform provider and Grizzly/Grafonnet for dashboard version control environment promotion and GitOps-based dashboard management.
OpenTelemetry Implementation
Define and lead the organisations OpenTelemetry (OTel) instrumentation strategy - establishing standards for automatic and manual instrumentation across application stacks running on AWS.
Design and deploy the OpenTelemetry Collector as the central telemetry processing layer - including:
Collector deployment patterns: agent (DaemonSet on EKS) gateway (centralised) and sidecar configurations
Exporter configuration - routing telemetry to Grafana Mimir (metrics) Grafana Loki (logs) Grafana Tempo (traces) AMP and CloudWatch
Instrument AWS workloads with OpenTelemetry SDKs across languages (Java Python Go) - including auto-instrumentation for containerised EKS workloads Lambda instrumentation using OTel Lambda layers and ECS task definition instrumentation.
Implement distributed tracing using OpenTelemetry - establishing trace propagation standards across microservices configuring context propagation (W3C TraceContext B3) and ensuring end-to-end trace visibility from frontend to backend to database.
Design OTel-based log correlation - enriching logs with trace IDs and span IDs to enable trace-to-log navigation in Grafana supporting faster RCA during incidents.
Implement OTel-based metric instrumentation - defining custom business and application metrics alongside system metrics following OTel semantic conventions for consistent metric naming and attribute tagging across services.
Define sampling strategies for distributed traces - including head-based sampling for development environments and tail-based sampling (via OTel Collector) for production environments balancing observability coverage with storage cost.
Manage OTel Collector as infrastructure - including horizontal scaling resource limits high-availability deployment collector health monitoring and pipeline performance optimisation.
AWS Observability Services Integration
Design the integration architecture between AWS-native observability services and Grafana - positioning Grafana as the unified observability plane while leveraging AWS-native services as data sources:
Amazon Managed Grafana (AMG) - evaluating and advising on AMG vs self-managed Grafana deployment decisions
Amazon Managed Service for Prometheus (AMP) - remote write from OTel Collector and Prometheus agents recording rules and alert manager integration
AWS X-Ray - ingesting X-Ray traces into Grafana Tempo or directly via Grafana X-Ray data source
AWS CloudTrail - audit log integration for security and compliance observability
VPC Flow Logs - network observability integration for security monitoring and traffic analysis
Implement infrastructure-level observability for core AWS services - EC2 (CloudWatch agent Node Exporter via OTel) EKS (kube-state-metrics cAdvisor OTel DaemonSet) RDS (Enhanced Monitoring Performance Insights) Lambda (OTel Lambda layer custom metrics) and API Gateway (access logs CloudWatch metrics).
Design business and synthetic monitoring - implementing Grafana Synthetic Monitoring or CloudWatch Synthetics for endpoint availability API health and user journey monitoring with Grafana alerting integration.
Delivery & Enablement
Lead observability implementation projects end-to-end - from requirements gathering and architecture design through deployment dashboard development alert tuning and team enablement.
Conduct observability maturity assessments for client environments - evaluating current monitoring coverage tool sprawl alert quality and SLO definition maturity and producing prioritised remediation roadmaps.
Develop and deliver observability enablement workshops for engineering and operations teams - covering OTel instrumentation Grafana dashboard development alert design and on-call best practices.
Produce observability architecture documentation - reference architectures runbooks onboarding guides and dashboard documentation that enable teams to self-serve and maintain the platform.
Advise on observability tool consolidation - helping organisations rationalise fragmented monitoring estates (Datadog New Relic Splunk Nagios Zabbix) toward a unified Grafana OTel platform including migration planning and cost impact analysis.
Experience
10 years of overall experience in cloud infrastructure platform engineering or DevOps.
5 years of hands-on AWS experience in production environments - not advisory or oversight roles.
3 years of hands-on Grafana Enterprise or SaaS implementation experience - designing deploying and operating Grafana at enterprise scale including Mimir Loki Tempo and the LGTM stack.
Proven experience implementing OpenTelemetry in production environments - including OTel Collector deployment SDK-based instrumentation and distributed tracing implementation.
Demonstrated experience building production-grade observability pipelines - from instrumentation through collection processing storage and visualisation.
Hands-on experience with PromQL for metrics querying and alerting - including complex queries recording rules and alert expression design.
Experience with LogQL (Grafana Loki) for log querying and log-based alerting.
Hands-on experience deploying observability infrastructure on Kubernetes (EKS) - including Prometheus Operator OTel DaemonSets Grafana deployment and persistent storage configuration.
Experience with Grafana as code - provisioning dashboards data sources and alert rules via YAML Terraform or Grafonnet.