Staff SRE Engineer

Miami, FL - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

About Us

Nu is one of the largest digital financial platforms in the world with more than 127 million customers across Brazil Mexico and Colombia. Guided by our mission to fight complexity and empower people we are redefining financial services in Latin America and this is still just the beginning of the purple future were building.

Listed on the New York Stock Exchange (NYSE: NU) we combine proprietary technology data intelligence and an efficient operating model to deliver financial products that are simple accessible and human.

Our impact has been recognized by global rankings such as Time 100 Companies Fast Companys Most Innovative Companies and Forbes Worlds Best Bank. Visit our institutional pagehttps:// the role

As a Staff Site Reliability Engineer for our Data Infra SRE team you will be a strategic leader responsible for defining the future of reliability for our Data Platform. This role is pivotal in shaping the transition toward a Data Mesh architecture and executing the Archipelago evolution plan. Your primary goal is to ensure the scalability of our data infrastructure by moving beyond traditional SRE practices and investing heavily in intelligent automation. By leading the development of AI-driven reliability solutions you will directly impact our ability to maintain high availability and performance across hundreds of business platforms and millions of global customers.

Youll be responsible for:

Defining Strategic Evolution: You will lead initiatives to refine the strategic direction of the SRE team ensuring the Data Platform infrastructure supports the companys long-term decentralization goals and the Archipelago evolution plan.
Designing Architectural Leadership: You will provide expert guidance for the design implementation and maintenance of highly reliable scalable and performant data systems.
Pioneering AI-Driven Automation: You will champion the adoption of advanced automation frameworks such as LangGraph and AI agents to autonomously resolve data platform crashes and coordinate incident responses.
Implementing Proactive System Health: You will develop sophisticated anomaly detection and predictive analytics mechanisms to identify and prevent potential issues before they impact the business.
Establishing Incident Protocols: You will lead the refinement of incident response protocols and post-incident analysis to drive continuous improvement in platform stability.
Mentoring and Technical Culture: You will mentor other engineers foster a culture of reliability engineering excellence and take ownership of technical initiatives that eliminate toil and optimize resource utilization

We are looking for a person who has:

Extensive Experience in SRE or Systems Engineering: A proven track record of leading complex technical initiatives and defining infrastructure strategies at a staff level or equivalent.
Proficiency in Functional Programming and Big Data: Solid experience with Clojure and Datomic for backend systems alongside Scala and Spark for high-volume data processing.
Expertise in Cloud Infrastructure: Deep practical knowledge of managing mission-critical workloads on AWS using Kubernetes Step Functions Lambdas and EC2.
Experience Building Automation from the Ground Up: A demonstrated ability to innovate and build automation frameworks in greenfield environments with a focus on implementing AI agents for operational efficiency.
Advanced Knowledge of Reliability Practices: Experience defining and enforcing Service Level Objectives managing system observability and leading disaster recovery and capacity planning.
Strategic Problem-Solving: The ability to translate complex architectural challenges into scalable software solutions while managing cost performance and security best practices.

Our SRE team is formally responsible for:

Service Level Objectives (SLO) Management: Defining monitoring and enforcing SLOs for critical data platform services.
System Observability: Implementing and maintaining comprehensive monitoring logging and tracing solutions across the data platform.
Toil Reduction: Identifying and automating repetitive manual tasks to improve team efficiency and focus on strategic initiatives.
Disaster Recovery and Business Continuity: Developing and testing disaster recovery plans to ensure the resilience of the data platform.
Capacity Planning: Forecasting resource needs and planning for infrastructure scaling to meet anticipated demand.
Performance Engineering: Optimizing system performance and addressing bottlenecks to ensure efficient operation.
Security Best Practices: Implementing and advocating for security best practices within the data platform.
Platform APIs: enabling alert management on virtually any service with simple interactions

Benefits

Opportunity of earning equity at Nu
Medical Insurance
Dental and Vision Insurance
Life Insurance and AD&D
Extended maternity and paternity leaves
Nucleo - Our learning platform of courses
NuLanguage - Our language learning program
NuCare - Our mental health and wellness assistance program
Extended maternity and paternity leaves
401K
Saving Plans - Health Saving Account and Flexible Spending Account
Work-from-home Allowance
Relocation Assistance Package if applicable.

Role location

Palo Alto United States
Miami United States
Washington DC United States
Durham United States

Work Model for this Role

Hybrid 2-3 times/week: Our hybrid work model brings us to the office at least twice a week on strategic days designed to maximize team connection and collaboration. For more details visit Experience:

Staff IC

About UsNu is one of the largest digital financial platforms in the world with more than 127 million customers across Brazil Mexico and Colombia. Guided by our mission to fight complexity and empower people we are redefining financial services in Latin America and this is still just the beginning of...

About Us

Youll be responsible for:

Defining Strategic Evolution: You will lead initiatives to refine the strategic direction of the SRE team ensuring the Data Platform infrastructure supports the companys long-term decentralization goals and the Archipelago evolution plan.
Designing Architectural Leadership: You will provide expert guidance for the design implementation and maintenance of highly reliable scalable and performant data systems.
Pioneering AI-Driven Automation: You will champion the adoption of advanced automation frameworks such as LangGraph and AI agents to autonomously resolve data platform crashes and coordinate incident responses.
Implementing Proactive System Health: You will develop sophisticated anomaly detection and predictive analytics mechanisms to identify and prevent potential issues before they impact the business.
Establishing Incident Protocols: You will lead the refinement of incident response protocols and post-incident analysis to drive continuous improvement in platform stability.
Mentoring and Technical Culture: You will mentor other engineers foster a culture of reliability engineering excellence and take ownership of technical initiatives that eliminate toil and optimize resource utilization

We are looking for a person who has:

Extensive Experience in SRE or Systems Engineering: A proven track record of leading complex technical initiatives and defining infrastructure strategies at a staff level or equivalent.
Proficiency in Functional Programming and Big Data: Solid experience with Clojure and Datomic for backend systems alongside Scala and Spark for high-volume data processing.
Expertise in Cloud Infrastructure: Deep practical knowledge of managing mission-critical workloads on AWS using Kubernetes Step Functions Lambdas and EC2.
Experience Building Automation from the Ground Up: A demonstrated ability to innovate and build automation frameworks in greenfield environments with a focus on implementing AI agents for operational efficiency.
Advanced Knowledge of Reliability Practices: Experience defining and enforcing Service Level Objectives managing system observability and leading disaster recovery and capacity planning.
Strategic Problem-Solving: The ability to translate complex architectural challenges into scalable software solutions while managing cost performance and security best practices.

Our SRE team is formally responsible for:

Service Level Objectives (SLO) Management: Defining monitoring and enforcing SLOs for critical data platform services.
System Observability: Implementing and maintaining comprehensive monitoring logging and tracing solutions across the data platform.
Toil Reduction: Identifying and automating repetitive manual tasks to improve team efficiency and focus on strategic initiatives.
Disaster Recovery and Business Continuity: Developing and testing disaster recovery plans to ensure the resilience of the data platform.
Capacity Planning: Forecasting resource needs and planning for infrastructure scaling to meet anticipated demand.
Performance Engineering: Optimizing system performance and addressing bottlenecks to ensure efficient operation.
Security Best Practices: Implementing and advocating for security best practices within the data platform.
Platform APIs: enabling alert management on virtually any service with simple interactions

Benefits

Opportunity of earning equity at Nu
Medical Insurance
Dental and Vision Insurance
Life Insurance and AD&D
Extended maternity and paternity leaves
Nucleo - Our learning platform of courses
NuLanguage - Our language learning program
NuCare - Our mental health and wellness assistance program
Extended maternity and paternity leaves
401K
Saving Plans - Health Saving Account and Flexible Spending Account
Work-from-home Allowance
Relocation Assistance Package if applicable.

Role location

Palo Alto United States
Miami United States
Washington DC United States
Durham United States

Work Model for this Role

Staff IC

Key Skills

Apply Now

About Company

Nubank

Você finalmente no controle do seu dinheiro. Controle total do cartão de crédito e da conta 100% digital

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Staff SRE Engineer

Miami, FL - USA

Job Summary

About Us

Our SRE team is formally responsible for:

Benefits

Role location

Work Model for this Role

About Us

Our SRE team is formally responsible for:

Benefits

Role location

Work Model for this Role

Key Skills

About Company

Related Jobs