Location: Norfolk VA / Richmond VA / Atlanta GA / Texas (Hybrid)
Job Description
Monitoring Strategy & Implementation Design and deploy Datadog dashboards for both application and database domains.
Configure alerting logic using Datadog monitors composite alerts and anomaly detection.
Develop custom scripts (Python Bash PowerShell) to support dynamic alerting and data enrichment.
Application Monitoring
Integrate Datadog APM with microservices ( ) and front-end layers.
Monitor service health latency error rates and throughput.
Enable distributed tracing and log correlation for root cause analysis.
Configure synthetic tests and real-user monitoring (RUM) for critical endpoints.
Database Monitoring (Oracle & SQL Server)
Set up Datadog integrations for Oracle and SQL Server to track:
Query performance blocking sessions deadlocks.
Connection pool usage replication lag backup status.
Tablespace utilization I/O latency and cache hit ratios.
Automate alerting for threshold breaches and unusual patterns using scripting.
Scripting & Automation
Build reusable scripts to:
Generate dynamic dashboards based on metadata.
Auto-adjust alert thresholds based on historical trends.
Integrate Datadog with ServiceNow for incident creation.
Maintain version-controlled script repositories and CI/CD pipelines for observability assets.
Stakeholder Collaboration
Engage with application and database teams to gather monitoring requirements.
Conduct workshops and KT sessions for support teams on dashboard usage and alert triage.
Partner with SREs to align monitoring with SLIs/SLOs and reliability goals.
Reporting & Governance
Generate weekly/monthly reliability reports using Datadog analytics.
Maintain SOPs runbooks and RCA documentation.
Ensure compliance with enterprise monitoring standards and audit readiness.
SkillSet
5 years in cloud monitoring and reliability engineering.
Proven expertise in Datadog (APM Infrastructure Logs Monitors Dashboards).
Hands-on experience with Oracle and SQL Server monitoring.
Strong SQL and PL/SQL skills for Oracle and SQL Server.
Proficiency in scripting (Python Bash PowerShell) for automation
Experience in configuring complex alerting logic using Datadog monitors.
Understanding of application architectures ( ).
Familiarity with ITSM tools (ServiceNow) CI/CD (Jenkins Git) and cloud platforms (Azure AWS).
Excellent communication and documentation capabilities.
Nice-to-Have Skills:
Datadog certification.
Knowledge of healthcare domain and compliance standards.
Experience with CI/CD tools and infrastructure as code (Terraform Ansible).