Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailRequirements:
We are seeking a proactive and technically strong Site Reliability Engineer (SRE) to ensure the stability performance and scalability of our Data Engineering Platform. You will work on cutting-edge technologies including Cloudera Hadoop Spark Airflow NiFi and Kubernetes-ensuring high availability and driving automation to support massive-scale data workloads especially in the telecom domain.
Key Responsibilities
Ensure platform uptime and application health as per SLOs/KPIs
Monitor infrastructure and applications using ELK Prometheus Zabbix etc.
Debug and resolve complex production issues performing root cause analysis
Automate routine tasks and implement self-healing systems
Design and maintain dashboards alerts and operational playbooks
Participate in incident management problem resolution and RCA documentation
Own and update SOPs for repeatable processes
Collaborate with L3 and Product teams for deeper issue resolution
Support and guide L1 operations team
Conduct periodic system maintenance and performance tuning
Respond to user data requests and ensure timely resolution
Address and mitigate security vulnerabilities and compliance issues Technical Skillset
Hands-on with Spark Hive Cloudera Hadoop Kafka Ranger
Strong Linux fundamentals and scripting (Python Shell)
Experience with Apache NiFi Airflow Yarn and Zookeeper
Proficient in monitoring and observability tools: ELK Stack Prometheus Loki
Working knowledge of Kubernetes Docker Jenkins CI/CD pipelines
Strong SQL skills (Oracle/Exadata preferred)
Job Description:
Familiarity with DataHub DataMesh and security best practices is a plus
Strong problem-solving and debugging mindset
Ability to work under pressure in a fast-paced environment.
Excellent communication and collaboration skills.
Ownership customer orientation and a bias for action
Full Time