Site Reliability Engineer Lead
Job Summary
Job Description:
At Bank of America we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients teammates communities and shareholders every day.
Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace attracting and developing exceptional talent supporting our teammates physical emotional and financial wellness recognizing and rewarding performance and how we make an impact in the communities we serve.
Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations.
At Bank of America you can build a successful career with opportunities to learn grow and make an impact. Join us!
Job Description:
This job is responsible for building and leading a team to deliver technology products and services that meet business outcomes. Key responsibilities include developing a technology strategy ensuring technology solutions comply with applicable standards promoting design engineering and organizational practices and advocating and advancing modern Agile solution delivery practices. Job expectations may include coaching mentoring providing feedback and hands on career development identifying emerging talent fostering leadership skills and managing stakeholders.
Overview:
Seeking a seasoned Site Reliability Engineering (SRE) Leader to drive the reliability scalability and performance of critical Infrastructure Automation platforms. This role will lead the design and implementation of SRE practices across a federated technology ecosystem ensuring operational excellence through automation observability and resilient architecture.
The ideal candidate will bring deep expertise in distributed systems cloud-native infrastructure SaaS application support and DevOps/SRE principles along with strong leadership and collaboration skills to influence cross-functional engineering and Production management teams and drive continuous improvement in service reliability.
Responsibilities:
SRE Strategy & Governance:
- Define and implement SRE frameworks including SLIs/SLOs/SLAs error budgets and incident response protocols.
- Establish governance models for reliability engineering across distributed teams.
- Champion a culture of observability proactive monitoring and continuous feedback loops.
Reactive & Proactive Problem Management:
- Lead root cause analysis (RCA) and post-incident reviews to identify systemic issues and prevent recurrence.
- Implement proactive problem detection using telemetry anomaly detection and trend analysis.
- Collaborate with engineering and operations teams to eliminate toil and reduce incident frequency and impact.
Capacity & Performance Management:
- Develop and maintain capacity models to ensure systems scale efficiently with business demand.
- Monitor performance trends and lead optimization efforts across infrastructure and applications.
- Partner with finance and engineering teams to align capacity planning with cost and growth objectives.
Platform Reliability & Automation:
- Drive automation of operational tasks including deployments scaling and recovery.
- Integrate reliability tooling with CI/CD pipelines ITSM platforms (e.g. ServiceNow) and observability systems.
Incident Management & Operational Excellence:
- Oversee major incident response escalation and communication processes.
- Develop and maintain runbooks playbooks and escalation protocols.
- Drive continuous improvement through blameless retrospectives and operational reviews.
Technical Leadership:
- Serve as a senior technical advisor and thought leader in SRE and platform engineering.
- Mentor and guide SRE teams and partner with engineering leaders across the enterprise.
- Provide input on staffing tooling strategy and budget planning for reliability initiatives.
Managerial Responsibilities:
This position may also have responsibilities for managing associates. At Bank of America all managers at this level demonstrate the following responsibilities in addition to those specific to the role listed above.
- Opportunity & Inclusion Champion: Models an inclusive environment for employees and clients aligned to company Great Place to Work goals.
- Manager of Process & Data: Demonstrates deep process knowledge operational excellence and innovation through a focus on simplicity data based decision making and continuous improvement.
- Enterprise Advocate & Communicator: Communicates enterprise decisions purpose and results and connects to team strategy priorities and contributions.
- Risk Manager: Ensures proper risk discipline controls and culture are in place to identify escalate and debate issues.
- People Manager & Coach: Provides inspection coaching and feedback to motivate differentiate and improve performance.
- Financial Steward: Actively manages expenses and budgets in alignment with objectives making sound financial decisions.
- Enterprise Talent Leader: Assesses talent and builds bench strength for roles across the organization.
- Driver of Business Outcomes: Delivers results by effectively prioritizing inspecting and appropriately delegating team work.
Required Qualifications:
- 10 years of experience in systems engineering DevOps or SRE roles in large-scale environments.
- Deep understanding of Linux/Unix & Windows systems networking and distributed computing.
- Proven experience with observability stacks (e.g. Dynatrace Grafana Splunk OpenTelemetry).
- Expertise in infrastructure-as-code and automation tools (e.g. Terraform Ansible Python).
- Strong knowledge of cloud platforms and container orchestration (Kubernetes).
- Demonstrated success in leading incident response and driving systemic improvements.
- Experience with capacity planning performance tuning and cost optimization.
- Excellent communication and stakeholder management skills including executive engagement.
Desired Qualifications:
- Experience with ITIL/ITSM processes and integration with platforms like ServiceNow.
- Familiarity with security and compliance in regulated industries (e.g. financial services).
- Background in performance engineering and infrastructure analytics.
- Experience developing dashboards and metrics for operational health and reliability.
Skills:
- Influence
- Risk Management
- Solution Design
- Stakeholder Management
- Technical Strategy Development
- Analytical Thinking
- Application Development
- Collaboration
- Result Orientation
- Solution Delivery Process
- Agile Practices
- Architecture
- Automation
- Data Management
- DevOps Practices
Shift:
1st shift (United States of America)Hours Per Week:
40Required Experience:
IC
About Company
What would you like the power to do? At Bank of America, our purpose is to help make financial lives better through the power of every connection.