We are seeking a skilled Site Reliability Engineer (SRE)/Data Infrastructure Engineer The ideal candidate will have expertise in Kubernetes big data technologies and proficiency in Python Go or Java. This role focuses on ensuring the reliability scalability and performance of our distributed systems and big data infrastructure.
Key Responsibilities:
- Design implement and maintain Kubernetesbased infrastructure for deploying and scaling applications
- Develop and optimize big data pipelines using Apache Spark and other related technologies
- Write efficient productiongrade code in Python Go or Java to automate processes and improve system reliability
- Implement observability solutions including logs metrics traces and profiles
- Collaborate with development teams to design and troubleshoot solutions for complex infrastructure issues
- Ensure high availability fault tolerance and disaster recovery for critical systems
- Optimize resource utilization and performance of big data processing workflows
Required Qualifications:
Strong experience with Kubernetes including cluster management and application deployment
- Proficiency in at least one of the following programming languages: Python Go or Java
- Handson experience with Apache Spark and big data processing frameworks
- Solid understanding of Linux systems and networking
- Experience with cloud platforms such as AWS GCP or Azure
- Familiarity with CI/CD pipelines and version control systems (e.g. Git)
- Understanding of distributed systems and microservices architecture
maintaining the reliability and performance of our big data infrastructure while continuously improving our Kubernetesbased deployment processes.