Position: Lead Systems Engineer
Location: Washington Dc #HYBRID
Duration: Long term
Job Description:
- We are seeking a seasoned Lead Systems Engineer with deep expertise in Datadog AWS and ServiceNow integration.
- In this leadership role you will oversee the design implementation and maintenance of comprehensive monitoring observability and incident management solutions for cloud-based infrastructure and applications.
- You will play a key role in guiding the team to ensure operational excellence system reliability and seamless collaboration across IT and engineering teams.
Responsibilities:
- Lead the architecture design and implementation of end-to-end monitoring solutions using Datadog ensuring high availability and performance of cloud-based services.
- Oversee the deployment and management of AWS resources (EC2 RDS Lambda ECS/EKS S3 etc.) ensuring adherence to best practices for scalability security and cost optimization.
- Define monitoring strategies and best practices including Datadog dashboards monitors alerts and custom metrics for comprehensive observability.
- Architect and manage the integration of Datadog with ServiceNow to automate incident management workflows event correlation and CMDB synchronization.
- Provide technical leadership and mentorship to junior engineers on best practices for monitoring logging and observability.
- Collaborate with cross-functional teams to integrate monitoring and logging into CI/CD pipelines and cloud infrastructure.
- Drive continuous improvement in system reliability including SLO/SLI definitions synthetic monitoring and anomaly detection.
- Contribute to and enforce Infrastructure as Code (IaC) standards using Terraform CloudFormation or similar tools.
- Participate in high-severity incident management root cause analysis and the implementation of corrective actions to prevent future occurrences.
Required Skills:
- Bachelors degree in Computer Science Information Technology or a related field (or equivalent experience).
- 5 years of experience with AWS cloud services including deployment management and optimization of cloud infrastructure.
- 3 years of hands-on experience with Datadog including complex dashboards integrations and custom metrics.
- 2 years of experience integrating Datadog with ServiceNow including incident management workflows event management and CMDB integration.
- Demonstrated experience leading teams or projects in a cloud operations or DevOps environment.
- Strong proficiency in scripting and automation (Python Bash or similar).
- Solid understanding of networking security best practices distributed systems and troubleshooting complex cloud architectures.
Preferred Skills (Nice to Have):
- Experience with Infrastructure as Code (Terraform CloudFormation).
- AWS certifications (e.g. AWS Certified Solutions Architect DevOps Engineer).
- Experience with Kubernetes monitoring and log aggregation solutions (Fluentd ELK stack).
- Familiarity with other observability tools like Prometheus or Grafana.
- ServiceNow certifications or experience with ServiceNow ITOM modules (Discovery Event Management CMDB).
- Excellent leadership and mentorship skills with experience in cross-functional collaboration.
- Strong leadership and communication skills to effectively guide a team.
- Excellent problem-solving skills with the ability to handle high-pressure situations.
- Organizational and prioritization skills to manage multiple tasks and projects effectively.
Thanks & Regards
--
LAXMAN
Team Lead - Talent Acquisition
KMM Technologies Inc.
CMMI Level 2 ISO 9001 ISO 20000 ISO 27000 Certified
Tel: Email: