Lead Infrastructure Engineer- Infrastructure Monitoring
Wilmington, DE - USA
Job Summary
We have an exciting opportunity for you to collaborate with passionate professionals solve complex problems and grow your career in a supportive innovative environment.
As a Lead Infrastructure Engineer at JPMorgan Chase within Corporate Technologys Enterprise Observability Platforms you will help build and operate a strategic market-leading Infrastructure Monitoring platform that strengthens critical service resilience and delivers trusted operational insights. You will be a hands-on technical contributor on an high-performing agile team building secure stable and scalable observability solutionsturning telemetry into actionable insights modernizing event-to-incident workflows enabling automation and AIOps-driven reliability improvements aligned to the firms business objectives. Job responsibilities
- Engineer operate and continuously improve the firms Infrastructure Monitoring platforms ensuring availability performance scalability and security.
- Build and run enterprise-grade Infrastructure Monitoring capabilities across Linux Windows and complex Network estates including platform-level onboarding and lifecycle management.
- Design and implement platform services integrations and telemetry collection across metrics logs events including OpenTelemetry collection patterns where applicable.
- Develop and maintain standardized onboarding patterns (agents/collectors configurations dashboards alert policies) to accelerate safe adoption at scale.
- Improve monitoring signal quality and usability through baselining threshold strategy noise reduction enrichment and topology/context alignment.
- Develop secure high-quality automation and production code; review debug and improve code/configuration written by others.
- Automate platform operations and reduce toil through scripting and CI/CD-driven configuration management; implement infrastructure-as-code deployment patterns
- Manage & maintain production health for the monitoring platform: lead triage perform RCA and deliver preventative engineering and resilience improvements.
- Partner with infrastructure application and SRE teams to align platform capabilities to SLIs/SLOs operational readiness and continuous improvement goals.
Contribute to a culture of diversity opportunity inclusion and respect.
Required qualifications capabilities and skills
- Formal training or certification on infrastructure engineering concepts and 5 years applied experience
- Proficiency with enterprise operating systems (Linux and/or Windows) including administration troubleshooting performance analysis and operational best practices within regulated production environments.
- Proven hands-on experience delivering and operating enterprise-scale Infrastructure Monitoring solutions across Linux Windows and/or Network estates
- Solid understanding and hands-on implementation of observability and telemetry concepts including metrics logs and events with experience using OpenTelemetry collection patterns and integrating telemetry into Downstream components
- Proficiency in automation and engineering practices including scripting and development with Python Ansible PowerShell / Bash and applying CI/CD-driven workflows for controlled secure and repeatable change management.
- Well-rounded experience in infrastructure across hardware platforms operating systems networking storage and databases (MS SQL Server Oracle Cassandra) including common deployment patterns integration architectures scaling and resiliency considerations and performance assessment.
- Experience implementing Infrastructure-as-Code (IaC) and configuration management practices using tools such as Terraform enabling standardized provisioning and scalable repeatable deployments.
- Hands-on experience operating in hybrid infrastructure environments including enterprise on-prem platforms and public/private cloud with familiarity supporting and migrating monitoring capabilities across cloud boundaries.
- Demonstrated ability to improve monitoring signal quality through baselining threshold strategy noise reduction enrichment and topology/context alignment supporting reliable event-to-incident workflows and operational insights.
Experience developing reviewing debugging and maintaining secure high-quality production code and platform configurations including automation supporting monitoring platforms and platform operations.
Preferred qualifications capabilities and skills
- Hands on experience operating one or more enterprise monitoring platforms such as SCOM Tivoli SMARTS IBM Instana DX NetOps ITNM Netcool Suite
- Experience with modern observability ecosystems such as Splunk Dynatrace Grafana Prometheus and interoperability patterns for telemetry integration routing and visualization.
- Experience with Kubernetes (e.g. EKS) for container orchestration and operations.
- Experience with topology-driven monitoring and correlation approaches for large-scale infrastructure environments.
- Knowledge of Event Management & AIOps workflows (noise reduction anomaly detection probable cause analysis guided remediation) with appropriate controls.
Required Experience:
IC
About Company
JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans ov ... View more