Key Responsibilities
Cluster Operations & Management
Manage and maintain container clusters (Kubernetes Docker) and open-source component clusters (Kafka Redis Elasticsearch) across multiple business units
Ensure optimal performance scalability and reliability of distributed systems
Infrastructure Platform Development
Design build and enhance infrastructure operation platforms
Develop and maintain systems for infrastructure management CI/CD pipelines monitoring/alerting and centralized logging
Drive platform standardization and automation initiatives
High Availability & Reliability
Ensure maximum uptime for production services through proactive monitoring and incident response
Continuously optimize service architecture deployment strategies and operational processes
Implement and maintain SLA/SLO frameworks and reliability engineering practices
Automation & Process Improvement
Lead the development of automated operations and maintenance systems
Create self-service tools and workflows to improve team productivity
Establish best practices for infrastructure such as code and configuration management
Required Qualifications
Experience & Education
2 years of hands-on experience in Systems Operations DevOps or Site Reliability Engineering (SRE)
Bachelors degree in Computer Science Engineering or related technical field preferred
Cloud & Infrastructure
Experience with public cloud platforms (AWS Azure or GCP) is highly valued
Strong understanding of large-scale internet architecture and distributed systems
Proven experience with infrastructure monitoring logging and observability tools
Technical Skills
Proficiency in scripting and automation using Shell Python or similar languages
Strong knowledge of containerization technologies (Kubernetes Docker)
Hands-on experience operating production-grade container clusters and managing CI/CD pipelines
Strong familiarity with common infrastructure components: Nginx MySQL Redis Kafka Elasticsearch
Advanced Networking (Preferred)
Experience with Service Mesh architectures Cilium CNI and eBPF technologies
Understanding network security load balancing and traffic management
Knowledge of cloud-native networking patterns and best practices
About Manus AI
Manus is a general AI agent that bridges minds and actions: it doesnt just think it delivers results. Manus excels at various tasks in work and life getting everything done while you rest. At Manus AI we offer a highly collaborative and innovative environment where experts across engineering research and business come together to push the boundaries of AI applications. If youre passionate about cutting-edge technology and making a real impact wed love to hear from you!
Contact us:
Required Experience:
IC