Grafana Observability SME
Poughkeepsie, NY - USA
Job Summary
Top Skills:
1. Production expertise across the full Grafana stack: Mimir Loki Tempo Alloy Beyla Grafana Application Observability Unified Alerting.
2. Strong PromQL LogQL and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
3. OpenTelemetry practitioner - OTLP collectors SDK/agent instrumentation for at least three of Go Python .
4. eBPF-based auto-instrumentation experience with Beyla (or equivalent - Pixie Cilium Tetragon) in a production context.
5. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration not webhook-only patterns); familiarity with ServiceNow ITOM AIOps event correlation and CMDB CI attachment.
6. Multi-environment hosting fluency - on-prem AWS Azure - and Linux/Windows host agent deployment at scale.
7. Dashboard-as-code and GitOps patterns (Grafana provisioning Terraform provider or Grizzly).
8. Excellent written communication - solution architecture documents runbooks and stakeholder-facing status reporting.
Role Summary
Own the end-to-end technical design build and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Go Python and workloads hosted across on-premises data centres AWS and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy alerting design dashboarding standards and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only - server and network health remain on SolarWinds and URL/synthetic monitoring remains on Uptrends.
Key Responsibilities
Platform architecture and configuration across all eight in-scope Grafana Cloud modules: Grafana 12 (visualization) Mimir (metrics 13-month retention) Loki (logs) Tempo (distributed tracing via OTLP) Alloy (telemetry collection agent) Beyla (eBPF zero-code auto-instrumentation) Application Observability (OTel-native APM) and Unified Alerting.
Tenancy and access design - organizations folders teams role-based access control dashboard variables template links and annotations.
Application instrumentation strategy by technology stack: Beyla eBPF as the default zero-code path for Simple and Medium apps; OpenTelemetry SDKs/agents ( Go Python ) for Complex apps requiring deeper traces and custom metrics; JMX Exporter prometheusclient and runtime-specific exporters where stack-appropriate.
Log pipeline engineering via Alloy - structured JSON Log4j/Logback Serilog NLog Windows Event Log Winston Pino loguru - with parsing rules tuned per stack and LogQL-based dashboards and alerts.
Alerting design - PromQL/LogQL/TraceQL rules severity taxonomy grouping routing and notification policies. Build a low-noise actionable alert feed; tune thresholds iteratively with application owners.
Single Pane of Glass - design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
Business Dashboards and Reporting - partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
ServiceNow ITOM integration - co-own the design and review of Grafana ServiceNow Event Management (native inbound integration) flow: event allow-list governance (deny by default) enrichment deduplication AIOps correlation automated incident creation with severity mapping and assignment group rules CMDB CI attachment and ServiceNow-as-master incident state.
Quality assurance authority across all technical deliverables - solution architecture document instrumentation runbooks dashboard and alert library integration test results.
Phased delivery execution - Mobilise & Discover Application Foundation (ML1) Onboarding of 40 Simple apps (ML2) Medium/Complex apps ITOM Integration (ML2 3) SPoG Dashboards & Reporting (ML3 4) Stabilisation KT and post-deployment support (ML4).
Knowledge transfer - produce platform operating procedures and conduct structured handover to the clients run team.
Required Skills & Experience
7 years in observability/monitoring engineering with deep recent hands-on Grafana Cloud experience (not just OSS Grafana).
Production expertise across the full Grafana stack: Mimir Loki Tempo Alloy Beyla Grafana Application Observability Unified Alerting.
Strong PromQL LogQL and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
OpenTelemetry practitioner - OTLP collectors SDK/agent instrumentation for at least three of Go Python .
eBPF-based auto-instrumentation experience with Beyla (or equivalent - Pixie Cilium Tetragon) in a production context.
Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration not webhook-only patterns); familiarity with ServiceNow ITOM AIOps event correlation and CMDB CI attachment.
Multi-environment hosting fluency - on-prem AWS Azure - and Linux/Windows host agent deployment at scale.
Dashboard-as-code and GitOps patterns (Grafana provisioning Terraform provider or Grizzly).
Excellent written communication - solution architecture documents runbooks and stakeholder-facing status reporting.
Nice to Have
Grafana Certified Professional or equivalent vendor certification.
Prior experience in a regulated utility energy or critical-infrastructure environment.
Familiarity with SolarWinds and Uptrends (sufficient to design clean boundaries with retained tooling not to administer them).
Experience with ServiceNow CSDM and Service Mapping governance.
Exposure to FinOps for observability - cardinality control log volume management retention tuning in Mimir/Loki.
Out of Scope for This Role
Server health and network monitoring (owned by SolarWinds).
URL/synthetic endpoint monitoring (owned by Uptrends).
ServiceNow ITSM workflow ownership - incident lifecycle remains with the clients ITSM/ITOM team; this role designs the integration not the downstream process.
About Company
Previous Next We Are Hiring! Search Staffing & Managed Services Recruitment Branding Solutions Product Engineering Solutions Why Reqroute? Our Market niche is towards Social Media Recruiting and we effectively use social media platforms to reach out to a pool of active/passive candida ... View more