Job Description:
We are seeking a highly experienced Site Reliability Engineer (SRE) with a strong Java development background to lead reliability initiatives and ensure the stability scalability and performance of mission-critical systems. This role blends deep hands-on engineering with leadership ownership and a proactive approach to reliability and operations.
The ideal candidate is someone who has evolved from a strong developer into an SRE/DevOps leader understands production systems deeply and can partner effectively with development platform and operations teams.
Key Responsibilities:
- Design build and maintain highly reliable scalable and fault-tolerant systems in production environments.
- Embed reliability best practices (SLOs SLIs error budgets) into the software development lifecycle.
- Work closely with development teams on Java Spring Boot microservices to improve operability and resilience.
- Automate operational workflows to reduce manual effort and improve system efficiency.
- Monitor system health performance and availability; proactively identify risks and bottlenecks.
- Lead incident management on-call support and root cause analysis for production issues.
- Drive continuous improvement initiatives focused on availability scalability and performance.
- Support and oversee release and deployment activities including after-hours support when required.
- Champion best practices around CI/CD infrastructure as code and cloud-native operations.
- Mentor engineers and provide technical leadership across SRE and development teams.
- Collaborate with stakeholders to align reliability goals with business priorities.
Required Qualifications
- 12 years of IT experience in SRE DevOps or Production Engineering
- Strong Java development experience (Java 17 Spring Boot Microservices Spring Web)
- Hands-on experience with OpenShift (OCP) Kubernetes and Docker
- Strong expertise in MongoDB (data modeling design optimization)
- Experience with Apache Kafka and event-driven architectures
- Working knowledge of Oracle Database
- Familiarity with BDD practices
- Solid experience with CI/CD automation and IaC (Terraform Ansible)
- Exposure to AI-assisted development tools (e.g. GitHub Copilot)
- Excellent troubleshooting skills in high-pressure production environments
- Strong communication collaboration and ownership mindset
Preferred Qualifications:
- Experience with monitoring and observability tools such as Prometheus Grafana and the ELK stack.
- Knowledge of security best practices compliance standards and production hardening.
- Prior experience leading or mentoring SRE teams or guiding engineers in reliability practices.
Job Description: We are seeking a highly experienced Site Reliability Engineer (SRE) with a strong Java development background to lead reliability initiatives and ensure the stability scalability and performance of mission-critical systems. This role blends deep hands-on engineering with leadersh...
Job Description:
We are seeking a highly experienced Site Reliability Engineer (SRE) with a strong Java development background to lead reliability initiatives and ensure the stability scalability and performance of mission-critical systems. This role blends deep hands-on engineering with leadership ownership and a proactive approach to reliability and operations.
The ideal candidate is someone who has evolved from a strong developer into an SRE/DevOps leader understands production systems deeply and can partner effectively with development platform and operations teams.
Key Responsibilities:
- Design build and maintain highly reliable scalable and fault-tolerant systems in production environments.
- Embed reliability best practices (SLOs SLIs error budgets) into the software development lifecycle.
- Work closely with development teams on Java Spring Boot microservices to improve operability and resilience.
- Automate operational workflows to reduce manual effort and improve system efficiency.
- Monitor system health performance and availability; proactively identify risks and bottlenecks.
- Lead incident management on-call support and root cause analysis for production issues.
- Drive continuous improvement initiatives focused on availability scalability and performance.
- Support and oversee release and deployment activities including after-hours support when required.
- Champion best practices around CI/CD infrastructure as code and cloud-native operations.
- Mentor engineers and provide technical leadership across SRE and development teams.
- Collaborate with stakeholders to align reliability goals with business priorities.
Required Qualifications
- 12 years of IT experience in SRE DevOps or Production Engineering
- Strong Java development experience (Java 17 Spring Boot Microservices Spring Web)
- Hands-on experience with OpenShift (OCP) Kubernetes and Docker
- Strong expertise in MongoDB (data modeling design optimization)
- Experience with Apache Kafka and event-driven architectures
- Working knowledge of Oracle Database
- Familiarity with BDD practices
- Solid experience with CI/CD automation and IaC (Terraform Ansible)
- Exposure to AI-assisted development tools (e.g. GitHub Copilot)
- Excellent troubleshooting skills in high-pressure production environments
- Strong communication collaboration and ownership mindset
Preferred Qualifications:
- Experience with monitoring and observability tools such as Prometheus Grafana and the ELK stack.
- Knowledge of security best practices compliance standards and production hardening.
- Prior experience leading or mentoring SRE teams or guiding engineers in reliability practices.
View more
View less