Senior Monitoring and Observability Lead 17+ years of experience
New York City, NY - USA
Job Summary
Job Title: Senior Monitoring and Observability Lead 17 years of experience
Location: New York NY (Hybrid)
We are currently seeking candidates who meet the following qualification
Mandatory Qualifications
Enterprise Platform Evaluation & Implementation: Ability to evaluate tools such as Datadog Splunk Dynatrace and SolarWinds etc. define selection criteria and deliver a hands-on implementation plan and migration approach.
Telemetry Fundamentals: Strong understanding of logs/metrics/traces event correlation time-series data and dashboard construction; familiarity with modern instrumentation patterns (Open Telemetry preferred). Infrastructure & Network Monitoring: Practical knowledge of SNMP syslog WMI APIs and agent-based data collection; comfort monitoring WAN/LAN/Wi-Fi performance firewall/load balancer signals and critical service dependencies. Cloud Monitoring: Experience monitoring workloads and services in at least one major cloud (Azure/AWS/GCP) including identity networking and compute telemetry.
ITSM / Workflow Integration: Experience integrating monitoring with ticketing routing escalation and knowledge workflows; ability to design severity and ownership models. Documentation & Governance: Ability to write clear technical documentation standards and runbooks suitable for institutional and audit needs.
AIOps capabilities such as anomaly detection dynamic baselining event deduplication correlation and predictive insights.
Service topology mapping dependency analysis and service health models (SLIs/SLOs preferred).
Datadog Splunk Observability Dynatrace SolarWinds or comparable enterprise observability platforms.
Centralized logging and analytics approaches; understanding of retention indexing/cost management and governance.
Windows/Linux monitoring virtualization platforms (VMware/Hyper-V) storage and backup monitoring network performance and configuration monitoring.
Operational alignment with CIS Benchmarks and secure monitoring practices (least privilege secrets handling encryption in transit RBAC auditability).
Relevant certifications (preferred not required): ITIL Foundation Security cloud certifications (Azure/AWS/GCP) vendor observability certifications.
Experience producing executive dashboards and institutional KPI reporting (availability performance incident trends capacity risk posture).
Ability to analyze complex systems identify root causes and implement durable fixes.
Ability to communicate clearly with both technical and administrative audiences.
Strong organizational skills and ability to prioritize competing needs.
Service-oriented mindset aligned to the institutions mission and stakeholder support expectations.
Experience with Cisco enterprise operations tooling and integration such as: TACACS/RADIUS SSO certificate lifecycle device compliance/drift detection and automated configuration deployment workflows.
Familiarity with campus scale operational needs (change windows tied to academic schedules distributed support models and stakeholder communication).
Duties/Responsibilities:
Provide support for SolarWinds alerting through current integrations implement upgrades and enhancements enable features
Design and implement an end-to-end observability approach spanning metrics logs traces and events across on-prem and cloud environments.
Lead hands-on evaluation and implementation efforts for enterprise platforms including Datadog Splunk
Observability Dynatrace and SolarWinds aligning tool capabilities to institutional requirements (availability performance security scalability cost).
Build and maintain telemetry collection standards (agent based and agentless) tagging/metadata conventions and service dependency views to improve root-cause isolation and service health reporting.
Establish durable operating practices for instrumentation onboarding configuration management lifecycle upgrades and platform reliability.
Implement alerting strategies that prioritize actionable notifications reduce noise and improve time-to-detect (MTTD) and time-to-resolve (MTTR).
Develop and tune thresholds dynamic baselines anomaly detection and intelligent event correlation (AIOps) to support 24x7 service reliability.
Support other infrastructure teams in creating runbooks escalation standards and response procedures. Role may require occasional support to fix issues hampering alerting and monitoring system
Contribute to post-incident reviews with measurable improvement outcomes such as alert tuning automation capacity adjustments resilience enhancements.
Build automation using APIs and scripting
(PowerShell/Python) to standardize onboarding reduce repetitive operations and support self-service dashboards for campus IT teams.
Integrate monitoring and alerting with enterprise workflows such as ITSM ticketing and routing through ServiceNow. Implement observability-as-code practices where feasible for repeatable deployment configuration drift reduction and consistent governance.
Partner with CUNY Infrastructure and Security teams to strengthen configuration practices aligned to CIS Benchmarks and other institutional hardening standards.
Build and maintain executive dashboards and reporting that highlight configuration drift operational risks and compliance posture relevant to servers endpoints network devices and cloud resources.
Ensure observability agents collectors and integrations follow least-privilege access secure credential handling and approved data-handling practices.
Translate technical telemetry into practical insights for infrastructure teams and leadership (service health risk trends capacity indicators etc.).
Collaborate with application owners and campus IT teams to improve visibility into service dependencies and user impacting issues.
Provide hands-on systems administration for campus and data center network management platforms including Cisco Catalyst switching environments and Cisco Nexus Dashboard.
Implement and maintain configuration management practices: backups version control golden configurations drift detection and standardized deployment patterns for Catalyst and Nexus environments.
Enable observability outcomes by integrating network telemetry with the enterprise monitoring/observability platform(s) (e.g. Datadog Splunk Observability Dynatrace SolarWinds) including SNMP polling/traps syslog NetFlow/IPFIX (where applicable) and streaming telemetry
Normalization of naming/tagging conventions for campus and data center devices to support accurate service maps dashboards and incident triage
Support high availability and resilience by managing platform health capacity planning backups/restore testing and continuity procedures for infrastructure management monitoring alerting and observability services
Administer lifecycle operations for network infrastructure and management tooling including software/firmware upgrades image standardization patching and coordinated maintenance windows aligned with institutional change management practices.
Produce clear documentation and training materials to support adoption and consistent operational practices.
If you meet these qualifications please submit your application via link provided in Linkedin
Kindly do not call the general line to submit your application.