As a Site Reliability Engineer (SRE) you will be responsible for ensuring the reliability scalability and performance of our production systems. You will work closely with Development QA and Infrastructure teams to maintain high availability optimize system performance and implement SRE best practices. Your role will focus on operational excellence incident management and building resilient systems while collaborating with engineering teams to improve application reliability.
Responsibilities:
- Monitor maintain and improve system reliability availability and performance.
- Participate in on-call rotations respond to incidents conduct root cause analysis (RCA) and implement preventive measures.
- Define and enforce Service Level Objectives (SLOs) Service Level Indicators (SLIs) and Error Budgets.
- Reduce manual toil through automation and self-healing systems.
- Analyze system performance identify bottlenecks and optimize infrastructure.
- Conduct capacity planning and scaling strategies to handle growth.
- Work with Development teams to ensure deployment strategies (blue-green canary) minimize downtime.
- Enhance monitoring logging and alerting (e.g. Prometheus Grafana ELK Datadog).
- Ensure proper observability for proactive issue detection.
- Implement distributed tracing for microservices troubleshooting.
- Manage cloud/infrastructure components (AWS /Azure Kubernetes Terraform).
- Automate operational tasks using scripting (Bash/Python) and Infrastructure as Code (IaC).
- Collaborate with Infrastructure teams to improve deployment reliability.
- Partner with Development teams to improve application resilience (retries circuit breakers graceful degradation).
- Work with QA teams to ensure reliability testing is part of the development lifecycle.
- Document runbooks operational procedures and postmortems.
Qualifications :
- Years of Experience: 3-5 Years
- Education: BS/MS in Computer Science preferred but can be waived for exceptional candidates.
Requirements:
- 3 years in SRE Production Engineering or Cloud Operations.
- Strong experience with Linux Kubernetes Docker and cloud platforms (AWS/GCP/Azure).
- Proficiency in monitoring (Prometheus Grafana Datadog).
- Coding/scripting skills (Python Bash) for automation.
- Experience with IaC (Terraform CloudFormation)
- Knowledge of networking security and database performance tuning.
Nice to Have:
- Knowledge of GitOps (ArgoCD Flux).
- Certifications like AWS CKA (Kubernetes) or Google SRE.
Additional Information :
- Employment Type: Full-time
- Weekend: 2 Days
- Work Model: Hybrid
Compensation and Benefits:
Join a Workplace That Values You
At Nifty Coders Pvt. Ltd. we celebrate innovation collaboration and the unique contributions each of our employees brings. We prioritize a work environment that encourages growth well-being and a healthy work-life balance. Here youll be part of a team that values creativity promotes flexibility and empowers individuals to thrive.
As part of our commitment to supporting you we offer a range of benefits and perks designed to enhance your work experience:
- Competitive compensation plans
- Two annual bonuses
- Paid Maternity Leave (4 months) and Paternity Leave (5 working days)
- Comprehensive medical insurance for you and your dependents
- Monthly and quarterly team-building events
- Transport allowance
- Mobile allowance
- Corporate home internet support
- Subsidized daily lunch
- A dynamic performance review process that fosters ongoing transparency between managers and team members
- Company-sponsored certifications programs for internal career growth and development
At Nifty Coders we foster a culture of collaboration continuous learning and innovation ensuring that every employee has the opportunity to grow and succeed.
Application Deadline: February 10 2026
Remote Work :
No
Employment Type :
Full-time
As a Site Reliability Engineer (SRE) you will be responsible for ensuring the reliability scalability and performance of our production systems. You will work closely with Development QA and Infrastructure teams to maintain high availability optimize system performance and implement SRE best practic...
As a Site Reliability Engineer (SRE) you will be responsible for ensuring the reliability scalability and performance of our production systems. You will work closely with Development QA and Infrastructure teams to maintain high availability optimize system performance and implement SRE best practices. Your role will focus on operational excellence incident management and building resilient systems while collaborating with engineering teams to improve application reliability.
Responsibilities:
- Monitor maintain and improve system reliability availability and performance.
- Participate in on-call rotations respond to incidents conduct root cause analysis (RCA) and implement preventive measures.
- Define and enforce Service Level Objectives (SLOs) Service Level Indicators (SLIs) and Error Budgets.
- Reduce manual toil through automation and self-healing systems.
- Analyze system performance identify bottlenecks and optimize infrastructure.
- Conduct capacity planning and scaling strategies to handle growth.
- Work with Development teams to ensure deployment strategies (blue-green canary) minimize downtime.
- Enhance monitoring logging and alerting (e.g. Prometheus Grafana ELK Datadog).
- Ensure proper observability for proactive issue detection.
- Implement distributed tracing for microservices troubleshooting.
- Manage cloud/infrastructure components (AWS /Azure Kubernetes Terraform).
- Automate operational tasks using scripting (Bash/Python) and Infrastructure as Code (IaC).
- Collaborate with Infrastructure teams to improve deployment reliability.
- Partner with Development teams to improve application resilience (retries circuit breakers graceful degradation).
- Work with QA teams to ensure reliability testing is part of the development lifecycle.
- Document runbooks operational procedures and postmortems.
Qualifications :
- Years of Experience: 3-5 Years
- Education: BS/MS in Computer Science preferred but can be waived for exceptional candidates.
Requirements:
- 3 years in SRE Production Engineering or Cloud Operations.
- Strong experience with Linux Kubernetes Docker and cloud platforms (AWS/GCP/Azure).
- Proficiency in monitoring (Prometheus Grafana Datadog).
- Coding/scripting skills (Python Bash) for automation.
- Experience with IaC (Terraform CloudFormation)
- Knowledge of networking security and database performance tuning.
Nice to Have:
- Knowledge of GitOps (ArgoCD Flux).
- Certifications like AWS CKA (Kubernetes) or Google SRE.
Additional Information :
- Employment Type: Full-time
- Weekend: 2 Days
- Work Model: Hybrid
Compensation and Benefits:
Join a Workplace That Values You
At Nifty Coders Pvt. Ltd. we celebrate innovation collaboration and the unique contributions each of our employees brings. We prioritize a work environment that encourages growth well-being and a healthy work-life balance. Here youll be part of a team that values creativity promotes flexibility and empowers individuals to thrive.
As part of our commitment to supporting you we offer a range of benefits and perks designed to enhance your work experience:
- Competitive compensation plans
- Two annual bonuses
- Paid Maternity Leave (4 months) and Paternity Leave (5 working days)
- Comprehensive medical insurance for you and your dependents
- Monthly and quarterly team-building events
- Transport allowance
- Mobile allowance
- Corporate home internet support
- Subsidized daily lunch
- A dynamic performance review process that fosters ongoing transparency between managers and team members
- Company-sponsored certifications programs for internal career growth and development
At Nifty Coders we foster a culture of collaboration continuous learning and innovation ensuring that every employee has the opportunity to grow and succeed.
Application Deadline: February 10 2026
Remote Work :
No
Employment Type :
Full-time
View more
View less