We are seeking a Principal Observability Architect to lead the strategic architecture evolution and operationalization of a modern multi-tenant Observability Platform-as-a-Service (OPaaS) tailored for a hybrid on-prem and cloud-native SaaS product.
You will architect a cloud-agnostic federated observability platform that supports real-time monitoring advanced telemetry pipelines and AI-powered insights to ensure platform reliability developer productivity and exceptional customer experiences. This role combines deep technical leadership with a strong focus on developer enablement platform resiliency and data governance.
What you get to do in this role:
Platform Architecture & Strategy
- Lead architecture and roadmap for a multi-region multi-cloud multi-tenant observability platform scalable across diverse customer environments and service boundaries.
- Architect near real-time telemetry ingestion pipelines with low-latency guarantees (seconds) using a mix of streaming and batch processing technologies.
- Define observability blueprints including telemetry SLAs data contracts tenant data isolation and cost-aware retention strategies for high-cardinality data.
- Ensure observability systems are cloud-native and container-aware supporting environments built on Kubernetes service meshes and serverless components.
Real-Time Monitoring & Detection
- Design and implement real-time metrics logs traces and event pipelines with technologies such as:
- VictoriaMetrics Prometheus Grafana Alertmanager
- Cribl Stream and Edge for dynamic routing and filtering
- VictoriaLogs for structured log analysis
- Embed real-time anomaly detection and signal correlation with context-aware alerting to reduce noise and MTTR.
- Integrate with alerting and incident response tools (PagerDuty Slack ServiceNow) for automated incident routing and contextual enrichment.
- Ensure observability of synthetic probes end-user transactions and critical SLOs with per-tenant granularity.
Instrumentation Developer Enablement & CI/CD Integration
- Standardize OpenTelemetry instrumentation across all services with prebuilt SDKs language libraries and semantic conventions.
- Architect OpenTelemetry deployment patterns (agent-based sidecar collector pipelines) with support for Kubernetes Lambda and edge environments.
- Embed observability validation gates into CI/CD workflows (e.g. GitHub Actions GitLab CI) to enforce telemetry compliance before production rollout.
- Provide self-service tools templates and training to enable developer teams to adopt observability by default.
AI for Observability & Productivity
- Leverage AI/ML for:
- Real-time anomaly detection and noise suppression
- Predictive incident detection and impact forecasting
- Auto-summarization of alert storms and telemetry bursts
- Multi-tenant root cause and blast radius correlation
- Build or integrate LLM-powered tools that support:
- Natural language querying of live telemetry
- AI-assisted debugging and dashboard generation
- Generative runbooks and incident summaries
Data Platform Architecture
- Architect hot and cold telemetry storage pipelines using:
- VictoriaMetrics and Cribl for hot-path observability
- Long-term retention in object storage (e.g. S3 GCS) using open formats (Parquet JSON)
- Federated querying engines like Trino for historical and cross-service analytics
- Implement cost-aware ETL strategies balancing real-time visibility with storage and ingestion optimization.
- Incorporate data governance PII handling and regional data compliance (e.g. GDPR SOC2) into telemetry architecture.
SaaS Operations & ITSM Integration
- Integrate observability into ITSM and incident response systems (e.g. ServiceNow Jira):
- Auto-create incidents enriched with correlated traces logs and metrics
- Provide real-time telemetry context in change and problem management flows
- Deliver customer-facing health dashboards SLA monitoring and per-tenant observability insights to support operational excellence and transparency.
Technical Leadership
- Lead cross-functional collaboration with SRE Platform Security and Engineering teams to evolve observability maturity.
- Define and document observability patterns anti-patterns and escalation workflows.
- Drive internal R&D around OpenTelemetry AI in observability high-cardinality telemetry and eBPF-based observability tooling.
Qualifications :
To be successful in this role you have:
- Experience in leveraging or critically thinking about how to integrate AI into work processes decision-making or problem-solving. This may include using AI-powered tools automating workflows analyzing AI-driven insights or exploring AIs potential impact on the function or industry.
- 10 years in DevOps SRE or Observability roles including 5 years in architecture or platform engineering.
- Proven experience designing and operating near real-time observability systems in global-scale SaaS environments.
- Deep expertise in OpenTelemetry (including collector deployment semantic conventions sampling strategies).
- Experience integrating observability in Kubernetes microservices and serverless ecosystems.
- Hands-on with telemetry data pipelines using Cribl Prometheus/VictoriaMetrics and log/trace platforms.
- Experience embedding telemetry validation in CI/CD workflows.
- Familiarity with AI/ML for observability (anomaly detection summarization impact correlation).
- Working knowledge of data privacy retention and compliance practices in observability.
Nice to Have:
- Experience with Trino S3 data lakes and long-term observability analysis.
- Experience building customer-facing observability features (dashboards SLAs health status pages).
- Contributions to open-source observability tools or standards.
- Knowledge of or hands-on experience with Agentic AI systems to drive autonomous remediation telemetry analysis or incident response.
- Relevant certifications (e.g. AWS GCP Azure OpenTelemetry Observability Practitioner).
GCS-23
Additional Information :
Work Personas
We approach our distributed world of work with flexibility and trust. Work personas (flexible remote or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here. To determine eligibility for a work persona ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service.
Equal Opportunity Employer
ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race color creed religion sex sexual orientation national origin or nationality ancestry age disability gender identity or expression marital status veteran status or any other category protected by addition all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements.
Accommodations
We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process or are unable to use this online application and need an alternative method to apply please contact for assistance.
Export Control Regulations
For positions requiring access to controlled technology subject to export control regulations including the U.S. Export Administration Regulations (EAR) ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities.
From Fortune. 2025 Fortune Media IP Limited. All rights reserved. Used under license.
Remote Work :
Yes
Employment Type :
Full-time