Senior Observability Engineer

Cluj-Napoca - Romania

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

The Senior Observability Engineer designs implements and continuously improves observability for live services (bespoke apps) enabling reliable operations and faster change. The role establishes practical monitoring standards and telemetry (metrics logs traces synthetics events) integrates observability into delivery pipelines and uses data to drive incident reduction faster diagnosis improved performance and measurable service outcomes.

This role operates under broad direction and is accountable for delivering improvements to reliability and operational performance collaborating with engineering platform security and service delivery stakeholders. The work aligns with Applications Management practices for live support problem management service performance and continuous improvement.

Responsibilities:

Define and implement observability strategy per service: SLIs/SLOs telemetry standards ownership model alerting principles and runbook requirements.
Build and maintain telemetry across services and platforms: instrumentation dashboards alerts and automated detection of abnormal behavior.
Improve incident and problem outcomes: reduce MTTR via better signals correlation and actionable alerts; support RCA and trend analysis; drive prevention and backlog items.
Operational performance and capacity: establish performance baselines detect regressions support capacity planning and measure availability/performance against targets.
Embed observability into delivery: integrate instrumentation and quality gates into CI/CD; define ready for ops acceptance criteria related to monitoring and supportability.
Enable teams: coach engineers and support teams on using observability tooling operational diagnostics and effective on-call practices.
Stakeholder management: communicate clearly with technical and non-technical stakeholders on service health risks trade-offs and improvement plans.

Ways of working and behaviors expected:

Analytical and structured approach to problem solving; comfortable working with incomplete data and iterating toward clarity.
Strong written and verbal communication and able to present service health and risks to both technical and non-technical audiences.
Proactive ownership: identifies gaps drives improvement actions and follows through to measurable outcomes.
Collaborative mindset: works effectively across engineering platform and service management functions.

Qualifications :

At least 3 years of relevant hands-on experience in the following areas:

Observability engineering:

Ability to design an observability approach that covers metrics logs traces and synthetics with clear intent (detection diagnosis prediction and validation).
Strong capability in alert design (signal vs noise) routing escalation and maintaining an actionable on-call experience (runbooks playbooks ownership).
Practical experience defining SLIs/SLOs error budgets and service health reporting that supports operational decision-making.

Operations and service management alignment:

Strong experience working in (or closely with) live production support including incident management major incidents and problem management.
Ability to translate operational pain into measurable improvement plans and to track improvements using service metrics and trends.
Working knowledge of change/release impact and how observability supports safe deployment and early-life support.

Technical foundations:

Cloud and container platforms: hands-on with at least one major cloud provider and modern runtime patterns (Kubernetes containers managed services).
Distributed systems fundamentals: latency saturation errors throughput dependencies and failure modes.
Automation / scripting: ability to automate repetitive diagnostics and observability configuration (e.g. Python Bash PowerShell or similar).
Security awareness: safe handling of telemetry access controls data retention and secure-by-default configuration.

Tooling (examples not an exhaustive list):

Experience with one or more observability stacks such as:
- Metrics: Prometheus Grafana Datadog CloudWatch/Azure Monitor etc.
- Logs: ELK/OpenSearch Splunk Loki etc.
- Tracing/APM: OpenTelemetry Jaeger Tempo New Relic Dynatrace Datadog APM etc.
Experience integrating observability with ITSM / collaboration tooling (ex: ServiceNow/Jira ChatOps incident paging).

Nice-to-have:

Experience with observability-as-code (dashboards/alerts via Terraform GitOps Helm API-driven configuration).
Experience with synthetics and RUM (web performance/user journey monitoring).
Experience building golden signal dashboards dependency maps or automated correlation.
Experience supporting regulated environments (ex: ISO controls auditability change governance).

Additional Information :

Discover some of the global benefits that empower our people to become the best version of themselves:

Finance: Competitive salary package share plan company performance bonuses value-based recognition awards referral bonus;
Career Development: Career coaching global career opportunities non-linear career paths internal development programmes for management and technical leadership;
Learning Opportunities: Complex projects rotations internal tech communities training certifications coaching online learning platforms subscriptions pass-it-on sessions workshops conferences;
Work-Life Balance: Hybrid work and flexible working hours employee assistance programme;
Health: Global internal wellbeing programme access to wellbeing apps;
Community: Global internal tech communities hobby clubs and interest groups inclusion and diversity programmes events and celebrations.

At Endava were committed to creating an open inclusive and respectful environment where everyone feels safe valued and empowered to be their best. We welcome applications from people of all backgrounds experiences and perspectivesbecause we know that inclusive teams help us deliver smarter more innovative solutions for our customers. Hiring decisions are based on merit skills qualifications and potential. If you need adjustments or support during the recruitment process please let us know.

Remote Work :

Employment Type :

Full-time

Responsibilities:

Define and implement observability strategy per service: SLIs/SLOs telemetry standards ownership model alerting principles and runbook requirements.
Build and maintain telemetry across services and platforms: instrumentation dashboards alerts and automated detection of abnormal behavior.
Improve incident and problem outcomes: reduce MTTR via better signals correlation and actionable alerts; support RCA and trend analysis; drive prevention and backlog items.
Operational performance and capacity: establish performance baselines detect regressions support capacity planning and measure availability/performance against targets.
Embed observability into delivery: integrate instrumentation and quality gates into CI/CD; define ready for ops acceptance criteria related to monitoring and supportability.
Enable teams: coach engineers and support teams on using observability tooling operational diagnostics and effective on-call practices.
Stakeholder management: communicate clearly with technical and non-technical stakeholders on service health risks trade-offs and improvement plans.

Ways of working and behaviors expected:

Analytical and structured approach to problem solving; comfortable working with incomplete data and iterating toward clarity.
Strong written and verbal communication and able to present service health and risks to both technical and non-technical audiences.
Proactive ownership: identifies gaps drives improvement actions and follows through to measurable outcomes.
Collaborative mindset: works effectively across engineering platform and service management functions.

Qualifications :

At least 3 years of relevant hands-on experience in the following areas:

Observability engineering:

Ability to design an observability approach that covers metrics logs traces and synthetics with clear intent (detection diagnosis prediction and validation).
Strong capability in alert design (signal vs noise) routing escalation and maintaining an actionable on-call experience (runbooks playbooks ownership).
Practical experience defining SLIs/SLOs error budgets and service health reporting that supports operational decision-making.

Operations and service management alignment:

Strong experience working in (or closely with) live production support including incident management major incidents and problem management.
Ability to translate operational pain into measurable improvement plans and to track improvements using service metrics and trends.
Working knowledge of change/release impact and how observability supports safe deployment and early-life support.

Technical foundations:

Cloud and container platforms: hands-on with at least one major cloud provider and modern runtime patterns (Kubernetes containers managed services).
Distributed systems fundamentals: latency saturation errors throughput dependencies and failure modes.
Automation / scripting: ability to automate repetitive diagnostics and observability configuration (e.g. Python Bash PowerShell or similar).
Security awareness: safe handling of telemetry access controls data retention and secure-by-default configuration.

Tooling (examples not an exhaustive list):

Experience with one or more observability stacks such as:
- Metrics: Prometheus Grafana Datadog CloudWatch/Azure Monitor etc.
- Logs: ELK/OpenSearch Splunk Loki etc.
- Tracing/APM: OpenTelemetry Jaeger Tempo New Relic Dynatrace Datadog APM etc.
Experience integrating observability with ITSM / collaboration tooling (ex: ServiceNow/Jira ChatOps incident paging).

Nice-to-have:

Experience with observability-as-code (dashboards/alerts via Terraform GitOps Helm API-driven configuration).
Experience with synthetics and RUM (web performance/user journey monitoring).
Experience building golden signal dashboards dependency maps or automated correlation.
Experience supporting regulated environments (ex: ISO controls auditability change governance).

Additional Information :

Discover some of the global benefits that empower our people to become the best version of themselves:

Finance: Competitive salary package share plan company performance bonuses value-based recognition awards referral bonus;
Career Development: Career coaching global career opportunities non-linear career paths internal development programmes for management and technical leadership;
Learning Opportunities: Complex projects rotations internal tech communities training certifications coaching online learning platforms subscriptions pass-it-on sessions workshops conferences;
Work-Life Balance: Hybrid work and flexible working hours employee assistance programme;
Health: Global internal wellbeing programme access to wellbeing apps;
Community: Global internal tech communities hobby clubs and interest groups inclusion and diversity programmes events and celebrations.

Remote Work :

Employment Type :

Full-time

Key Skills

APIs
C/C++
Computer Graphics
Go
React
Redux
Node.js
AWS
Library Services
Assembly
GraphQL
High Voltage

Apply Now

About Company

Endava

For over two decades, we have been harnessing technology to drive meaningful change. By combining world-class engineering, industry expertise and a people-centric mindset, we consult and partner with our customers to create technological solutions that drive?innovation and transfor ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click