Operation Engineer
Role summary - Were looking for a passionate Site Reliability Engineer (SRE) to join our infrastructure team. You must be a systems thinker problem-solver and automation advocate with a proven track record of building resilient scalable systems. If you thrive in bridging development and operations obsess over monitoring/metrics and geek out on turning manual processes into self-healing systems we want you.
Location: Guangzhou
On-site
Full-time
What youll do
- Build maintain and optimize self-built Kubernetes platforms.
- Deploy and maintain Alibaba Cloud ACK (Alibaba Cloud Container Service for Kubernetes).
- Construct maintain and optimize Alibaba Cloud integrated delivery platforms.
- Deploy and maintain distributed file systems such as GlusterFS.
- Deploy maintain and use Jenkins CI/CD (Continuous Integration/Continuous Deployment).
- Provide support for daily system launch and production.
- Design build and maintain highly available and scalable distributed systems.
- Real-time track system health status through monitoring alerting and automation tools (e.g. Prometheus Grafana).
- Develop automated tools and scripts (e.g. Python/Shell) to reduce manual operations and improve O&M (Operations and Maintenance) efficiency.
- Analyze system capacity requirements formulate capacity expansion strategies and balance costs and performance.
- Identify performance bottlenecks and optimize service response time throughput and resource utilization.
- Compile technical documents and O&M documents.
- Learn and introduce other excellent tools.
- Undertake other tasks assigned by the leadership.
Who you are
- More than 5 years of experience in O&M/SRE (Site Reliability Engineering)/DevOps with practical experience in large-scale distributed systems being preferred.
- Familiar with the deployment and maintenance of components related to cloud platforms such as Alibaba Cloud AWS and Azure.
- Familiar with the deployment maintenance and optimization of cloud Kubernetes services such as ACK (Alibaba Cloud Container Service for Kubernetes) and AKS (Azure Kubernetes Service).
- Familiar with SRE methodologies (e.g. SLI/SLO/Error Budget) and possess the ability to troubleshoot faults and optimize systems.
- Proficient in cloud-based Kubernetes and self-built Kubernetes with complete project implementation experience.
- Skilled in the deployment and maintenance of distributed file system components such as GlusterFS.
- Familiar with monitoring and log systems (e.g. Zabbix Prometheus Grafana ELK Datadog).
- Excellent logical thinking and problem-solving abilities.
- Strong sense of responsibility and self-motivation in O&M work.
- Strong self-learning comprehension and hands-on abilities.
- Good communication skills and team collaboration skills.
#LI-DW1
#LI-Onsite
Required Experience:
IC
Operation EngineerRole summary - Were looking for a passionate Site Reliability Engineer (SRE) to join our infrastructure team. You must be a systems thinker problem-solver and automation advocate with a proven track record of building resilient scalable systems. If you thrive in bridging developmen...
Operation Engineer
Role summary - Were looking for a passionate Site Reliability Engineer (SRE) to join our infrastructure team. You must be a systems thinker problem-solver and automation advocate with a proven track record of building resilient scalable systems. If you thrive in bridging development and operations obsess over monitoring/metrics and geek out on turning manual processes into self-healing systems we want you.
Location: Guangzhou
On-site
Full-time
What youll do
- Build maintain and optimize self-built Kubernetes platforms.
- Deploy and maintain Alibaba Cloud ACK (Alibaba Cloud Container Service for Kubernetes).
- Construct maintain and optimize Alibaba Cloud integrated delivery platforms.
- Deploy and maintain distributed file systems such as GlusterFS.
- Deploy maintain and use Jenkins CI/CD (Continuous Integration/Continuous Deployment).
- Provide support for daily system launch and production.
- Design build and maintain highly available and scalable distributed systems.
- Real-time track system health status through monitoring alerting and automation tools (e.g. Prometheus Grafana).
- Develop automated tools and scripts (e.g. Python/Shell) to reduce manual operations and improve O&M (Operations and Maintenance) efficiency.
- Analyze system capacity requirements formulate capacity expansion strategies and balance costs and performance.
- Identify performance bottlenecks and optimize service response time throughput and resource utilization.
- Compile technical documents and O&M documents.
- Learn and introduce other excellent tools.
- Undertake other tasks assigned by the leadership.
Who you are
- More than 5 years of experience in O&M/SRE (Site Reliability Engineering)/DevOps with practical experience in large-scale distributed systems being preferred.
- Familiar with the deployment and maintenance of components related to cloud platforms such as Alibaba Cloud AWS and Azure.
- Familiar with the deployment maintenance and optimization of cloud Kubernetes services such as ACK (Alibaba Cloud Container Service for Kubernetes) and AKS (Azure Kubernetes Service).
- Familiar with SRE methodologies (e.g. SLI/SLO/Error Budget) and possess the ability to troubleshoot faults and optimize systems.
- Proficient in cloud-based Kubernetes and self-built Kubernetes with complete project implementation experience.
- Skilled in the deployment and maintenance of distributed file system components such as GlusterFS.
- Familiar with monitoring and log systems (e.g. Zabbix Prometheus Grafana ELK Datadog).
- Excellent logical thinking and problem-solving abilities.
- Strong sense of responsibility and self-motivation in O&M work.
- Strong self-learning comprehension and hands-on abilities.
- Good communication skills and team collaboration skills.
#LI-DW1
#LI-Onsite
Required Experience:
IC
View more
View less