Site Reliability Engineer Retail & Banking Technology domain ING Hubs Romania

Bucharest - Romania

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Discover ING Hubs Romania

ING Hubs Romania offers 130 services in software development data management non-financial risk & compliance audit and retail operations to 24 ING units worldwide with the help ofover2000 high-performing engineers risk and operations professionals.

We started out in 2015 as INGs software development hub then steadily expanded our range to include more services and we provide borderless services with bank-wide capabilities andoperatefrom two locations: Bucharest and Cluj-Napoca.

Our tech capabilitiesremainthe core of our business with more than 1800 colleagues active in Dataand AnalyticsTechTech Foundation andChannelsRetail Core Banking and Architecture andGlobal Products and Technology Services.

We enjoy a flexible way of working and a highly collaborative environment where fair and constructive feedback is encouraged.

For us driver of our are guided and rewarded by a shared desire to make the world a better place one innovative solution at a time. Our colleagues make it their job to do impactfulthingsand they love doing it in good company. Do you

The Mission

INGs ambition is to be the number one digital banking brand in Europe offering customers everywhere the same empowering personalized and differentiating experience. A collaborative communicative Site Reliability Engineer will change the way were working.

R&BT SRE team

The R&BT Site Reliability Engineering (SRE) team is a multidisciplinary team of senior engineers with proven track records in development and operations across applications and infrastructure. The primary goal is to continuously and structurally improve the reliability and maintainability of the IT environments involved with the R&BT Platforms delivered and managed from different (international) ING domains.

Objective: Site Reliability Engineering (SRE) enhances the reliability and scalability of BTP platform services through collaborative efforts prioritizing availability performance efficiency and observability.
Measurement: SRE targets increased MTBF decreased MTTR and minimized operational toil.
Approach: This is facilitated by automation standardized procedures and the adoption of SRE best practices.
Cultivate a Reliability Mindset: The aim is to foster a culture of reliability throughout the BTP organization encouraging proactive behaviours and attitudes.

Your day to day

Ensure Service Level Objective (SLO) levels are set and met;
Optimize our Observability tooling like Grafana dashboards;
Report on GSRE targets and KPIs;
Do yearly Well Architected Reviews and observability Assessments for all critical components;
Drive Always Available mindset and behaviour within the R&BT organization. Be able to recognize shortcomings in knowledge and expertise and deliver the necessary resources skills guidance and training to DevOps teams where needed;
Define and enhance standards for logging monitoring and alerting and actively monitor end to end platform performance through white and black box monitoring tools;
Improve incident response practices and be actively engaged in incident response of escalated and critical incidents. On call duty is currently not part of the job but should not be an objection if and when required;
Participate in Root Cause Analysis. Prioritize and implement the RCA recommendations through improvement plans with the responsible Squads / DevOps teams;
Track and trace actions out of post mortems and Emirs;
Drive Continuous improvement on all services in the R&BT Platforms through analysis of the current level of service functional and technical setup code dev/ops practices and the underlying causes of incidents underperformance etc.
Roll out new resilience features trough the organization;
Setting upand maintaining automatic reporting and feedback loops;
Contribute to automating Build Test and Deployment practices through the CI/CD pipeline;
Contribute to tuning application resources and updating high available deployment patterns of (mostly) container and VM based environments;
Initiate and contribute to new SRE initiatives like AI Ops Chaos Engineering migrations to Public Cloud and Error Budgeting;
Participate and initiate experiments with new tools and concepts and evaluate its value against set goals.

What youll bring to the team

Operations expert: 4 years of experience working using Agile DevOps principles.

Solid understanding how technology setup and ITSM processes relate to service level objectives like Availability (time based successful call

rate response times) MTTR and MTBF.

Good understanding of microservices architecture and related high availability / resilience patterns and experience building systems with multiple layers of redundancy to withstand failures in software hardware network infrastructure.

Proven experience:

working as a Site Reliability Engineer or DevOps engineer.
scripting in at least one of the following: Ruby Python Bash PowerShell.
set up Build and Deployment pipelines in Azure DevOps (ADO).
set up white-box monitoring and able to formulate meaningful metrics for monitoring and reporting: Grafana TraceING.
eliminate toil through automation and process optimization
Able to coordinate/lead incident response and Post mortem / root cause analysis activities.
Understanding of IT Service Management processes (ING Global Way of Working) and the way the relate to SRE objectives.
God understanding of Public Cloud concepts.

Prior work experience with tools:

CI/CD Pipeline: OnePipeline / Azure Devops / Kingsroad.
Cloud computing and container orchestration: Linux VMs and Kubernetes container platforms. Knowledge of OpenShift AKS and related certifications are a pre.
Touchpoint service mesh and SDK/Merak.
logging/monitoring/alerting: Kafka ELK Prometheus and IAT. Experience with blackbox monitoring tools like Rigor/Splunk and AI Ops tools like Loom is a pre.
Backlog management: Azure Boards
ITSM: SNOW

The ideal candidate has:

A Bachelor or Masters degree in computer science or related field.
Experience coaching and training DevOps engineers on technical subjects.
Previous experience as a consumer of R&BT Platforms preferably Touchpoint Platform.
Understanding of the ING application risk journey

If you want to deep dive into the processing of personal data conducted by ING Hubs Romania during the recruitment process and your rights related to it read theprivacy noticeson ourwebsite(make sure to scroll until you reach the Data Protection section/ Candidates tab).

Required Experience:

Discover ING Hubs RomaniaING Hubs Romania offers 130 services in software development data management non-financial risk & compliance audit and retail operations to 24 ING units worldwide with the help ofover2000 high-performing engineers risk and operations professionals.We started out in 2015 as I...

Discover ING Hubs Romania

We enjoy a flexible way of working and a highly collaborative environment where fair and constructive feedback is encouraged.

The Mission

R&BT SRE team

Objective: Site Reliability Engineering (SRE) enhances the reliability and scalability of BTP platform services through collaborative efforts prioritizing availability performance efficiency and observability.
Measurement: SRE targets increased MTBF decreased MTTR and minimized operational toil.
Approach: This is facilitated by automation standardized procedures and the adoption of SRE best practices.
Cultivate a Reliability Mindset: The aim is to foster a culture of reliability throughout the BTP organization encouraging proactive behaviours and attitudes.

Your day to day

Ensure Service Level Objective (SLO) levels are set and met;
Optimize our Observability tooling like Grafana dashboards;
Report on GSRE targets and KPIs;
Do yearly Well Architected Reviews and observability Assessments for all critical components;
Drive Always Available mindset and behaviour within the R&BT organization. Be able to recognize shortcomings in knowledge and expertise and deliver the necessary resources skills guidance and training to DevOps teams where needed;
Define and enhance standards for logging monitoring and alerting and actively monitor end to end platform performance through white and black box monitoring tools;
Improve incident response practices and be actively engaged in incident response of escalated and critical incidents. On call duty is currently not part of the job but should not be an objection if and when required;
Participate in Root Cause Analysis. Prioritize and implement the RCA recommendations through improvement plans with the responsible Squads / DevOps teams;
Track and trace actions out of post mortems and Emirs;
Drive Continuous improvement on all services in the R&BT Platforms through analysis of the current level of service functional and technical setup code dev/ops practices and the underlying causes of incidents underperformance etc.
Roll out new resilience features trough the organization;
Setting upand maintaining automatic reporting and feedback loops;
Contribute to automating Build Test and Deployment practices through the CI/CD pipeline;
Contribute to tuning application resources and updating high available deployment patterns of (mostly) container and VM based environments;
Initiate and contribute to new SRE initiatives like AI Ops Chaos Engineering migrations to Public Cloud and Error Budgeting;
Participate and initiate experiments with new tools and concepts and evaluate its value against set goals.

What youll bring to the team

Operations expert: 4 years of experience working using Agile DevOps principles.

Solid understanding how technology setup and ITSM processes relate to service level objectives like Availability (time based successful call

rate response times) MTTR and MTBF.

Proven experience:

working as a Site Reliability Engineer or DevOps engineer.
scripting in at least one of the following: Ruby Python Bash PowerShell.
set up Build and Deployment pipelines in Azure DevOps (ADO).
set up white-box monitoring and able to formulate meaningful metrics for monitoring and reporting: Grafana TraceING.
eliminate toil through automation and process optimization
Able to coordinate/lead incident response and Post mortem / root cause analysis activities.
Understanding of IT Service Management processes (ING Global Way of Working) and the way the relate to SRE objectives.
God understanding of Public Cloud concepts.

Prior work experience with tools:

CI/CD Pipeline: OnePipeline / Azure Devops / Kingsroad.
Cloud computing and container orchestration: Linux VMs and Kubernetes container platforms. Knowledge of OpenShift AKS and related certifications are a pre.
Touchpoint service mesh and SDK/Merak.
logging/monitoring/alerting: Kafka ELK Prometheus and IAT. Experience with blackbox monitoring tools like Rigor/Splunk and AI Ops tools like Loom is a pre.
Backlog management: Azure Boards
ITSM: SNOW

The ideal candidate has:

A Bachelor or Masters degree in computer science or related field.
Experience coaching and training DevOps engineers on technical subjects.
Previous experience as a consumer of R&BT Platforms preferably Touchpoint Platform.
Understanding of the ING application risk journey

Required Experience:

Key Skills

Apply Now

About Company

ING

ING Global Career Opportunities - Welcome to 'careers at ING'. We give you the space to develop yourself as an intern, trainee and professional. Check out our opportunities. Jump on!

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click