Site Reliability Engineer, Associate

BlackRock

Not Interested
Bookmark
Report This Job

profile Job Location:

Edinburgh - UK

profile Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

About this role

Were looking for an SRE with strong Kafka experience and a deep understanding of SRE best practices. Youll combine handson technical improvements with the ability to delegate work effectively to EventBus developers.

Youll collaborate closely with the EventBus Kafka Telemetry and Incident Response teams while also working independently to improve monitoring reduce noise strengthen alerting and track remediation progress.

This role sits at the centre of a global platform used by hundreds of developers and joins a fastgrowing experienced SRE group based in Edinburgh.

About the Team

The Aladdin EventBus is built on Kafka and enables teams to publish and subscribe to distributed events in near real time. As part of the Aladdin Graph groupa core Platform Engineering functionthe EventBus team supports developers across the firm in designing building and operating eventdriven and APIbased systems.

EventBus is now a critical dependency for key applications including our release system and API infrastructure. This drives a high bar for availability incident responsiveness and operational excellence. The SRE function supports this by improving observability streamlining incident processes and identifying gaps that meaningfully improve platform reliability.

Key Responsibilities:

As the SRE for EventBus you will drive stability resiliency and observability through:

  • Staying informed on all EventBus incidents including impact root cause detection and ongoing remediation
  • Responding to incidents calmly and efficiently communicating clearly with reporters and partner teams and recommending remediations based on urgency and impact
  • Proposing improvements informed by prior incidents potential risks and industry standardse.g. new metrics SLOs fallback mechanisms
  • Leading incident retrospectives and sharing insights with the wider team
  • Creating and distributing postmortems for highimpact operational events
  • Collaborating with developers to write maintain and promote runbooks and playbooks
  • Improving alert quality and reducing alert fatigue by tuning signaltonoise ratios
  • Designing and implementing automated recovery solutions for known issues
  • Building a roadmap toward 24/7 availability rapid failover recovery selfdetection and automated resolution of common issues
  • Helping EventBus users diagnose issues with their own producers and consumers

Requirements

  • 3 years in an SRE role including experience with defining and managing SLOs
  • Strong understanding of SRE principles (Golden Signals error budgets synthetic monitoring signaltonoise optimisation)
  • Extensive handson experience with Kafka
  • Experience using monitoring tools (Grafana and Splunk preferred) including building dashboards alerts and reports

Suggested Requirements

  • Java Developer Experience: Experience with Java or another objectoriented language
  • CI/CD & Release Management: Experience managing pipelines using Azure DevOps or other Gitbased tools
  • Cloud Experience: Practical experience with at least one public cloud provider preferably Azure or AWS
  • Agile Development: Familiarity with agile ways of working sprint ceremonies and backlog planning
  • Scripting & Automation: Proficiency in Python or Golang for automating operational tasks
  • Monitoring & Observability: Strong understanding of logging monitoring and observability practices including writing integration scripts
  • Collaboration & Communication: Strong crossteam collaboration skills and excellent written and verbal communication

Our benefits

To help you stay energized engaged and inspired we offer a wide range of employee benefits including: retirement investment and tools designed to help you in building a sound financial future; access to education reimbursement; comprehensive resources to support your physical health and emotional well-being; family support programs; and Flexible Time Off (FTO) so you can relax recharge and be there for the people you care about.

Our hybrid work model

BlackRocks hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees while supporting flexibility for all. Employees are currently required to work at least 4 days in the office per week with the flexibility to work from home 1 day a week. Some business groups may require more time in the office due to their roles and responsibilities. We remain focused on increasing the impactful moments that arise when we work together in person aligned with our commitment to performance and innovation. As a new joiner you can count on this hybrid model to accelerate your learning and onboarding experience here at BlackRock.

About BlackRock

At BlackRock we are all connected by one mission: to help more and more people experience financial well-being. Our clients and the people they serve are saving for retirement paying for their childrens educations buying homes and starting businesses. Their investments also help to strengthen the global economy: support businesses small and large; finance infrastructure projects that connect and power cities; and facilitate innovations that drive progress.

This mission would not be possible without our smartest investment the one we make in our employees. Its why were dedicated to creating an environment where our colleagues feel welcomed valued and supported with networks benefits and development opportunities to help them thrive.

For additional information on BlackRock please visit @blackrock Twitter: @blackrock LinkedIn: is proud to be an Equal Opportunity Employer. We evaluate qualified applicants without regard to age disability race religion sex sexual orientation and other protected characteristics at law.


Required Experience:

IC

About this roleWere looking for an SRE with strong Kafka experience and a deep understanding of SRE best practices. Youll combine handson technical improvements with the ability to delegate work effectively to EventBus developers.Youll collaborate closely with the EventBus Kafka Telemetry and Incide...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting

About Company

Company Logo

BlackRock is one of the world’s preeminent asset management firms and a premier provider of investment management. Find out more information here.

View Profile View Profile