Senior Reliability Engineer:
Senior Reliability Engineer (Senior level)
Description
- Reliability Engineering (SRE) is a modern way of delivering IT Solutions by imbibing Software engineering principles in Service Delivery to reduce IT Risk to business improve business resilience attain predictability & reliability optimize cost of IT Infra and Ops
- A Reliability Engineer typically has deep software engineering experience encompassing design build deploy and manage / maintain an IT solution ensuring resilience reliability and performance.
- A Reliability Engineer is a bridge between development and operations by applying a software engineering mindset to the development deployment and maintenance of applications to maximize system reliability & automation while improving efficiencies by optimizing resources
Responsibilities
- Defining SLA/SLO/SLI for a product / service
- Engineering in resilient design and implementation practices into solutions as they go through the product life cycle
- Engineering out manual effort (Toil) through the development of automated processes and services (e.g. Automated Management of Systems CI/CD improvements)
- Developing Observability Solutions to track report and measure SLA adherence
- Help Optimize Cost of IT Infra & Operations - FinOps
- Critical Situation management
- SOP / Runbook automation Toil reduction
- Data Analytics & System trend analysis
Typical Skills and Background
- 7 years of experience in software product engineering principles processes and systems
- Hands-on experience in Java / J2EE one of web server (Apache Tomcat or IBM HTTP Server) one of the application servers (Tomcat/WebSphere) and any major RDBMS like Oracle
- Hands-on experience in at least one CI-CD (Azure DevOps GitLab CI/CD Jenkins) and IaC tools (Terraform AWS CloudFormation Ansible etc.)
- Experience in at least one cloud technology (AWS/Azure/GCP etc. and Docker Pivotal Kubernetes OpenShift etc.) and its reliability tools (Azure AppInsight CloudWatch Azure Monitor etc.)
- Experience in Linux (RHEL) operating system performance monitoring parameters and their interpretation commands used for monitoring
- Experience in Observability - APM tools (Dynatrace AppDynamics etc.) metrics / log consolidation (Splunk) and ELK Stack
- Defining NFRs and SLA/SLO/SLI agreement for a product / platform / services
- Knowledge on queuing models used thread pools request servicing processes etc.
- Knowledge in Web Services SOA ESB (DataPower) RESTFul
- Knowledge of application design patterns J2EE application architectures Microservices Spring boot & Cloud native architectures
- Proficiency in Java runtimes Core Java Garbage collection JVM parameters tuning
- Experience in performance tuning on Application Servers (Tomcat/WAS)
- Experience in trouble shooting Performance / Scalability / Availability issues
- Experience in Thread dump heap dump generation & analysis
- Knowledge on Query tuning and database designs & models
- Knowledge at least one automation scripting language like Python
- Mastery in collaborative software development using Git Jira Confluence etc.
- AI/ML & Data Analytics knowledge and experience is a desirable
Required Experience:
Senior IC