Role Summary
Responsible for day-to-day support of a high-availability AI/ML platform ensuring stability across Linux systems Kubernetes environments and enterprise infrastructure. The role focuses on operational support incident management and basic troubleshooting in a regulated production environment.
Key Responsibilities
Perform routine platform operations (access requests monitoring health checks)
Handle ticket triage follow runbooks and escalate when needed
Manage Linux systems (logs services basic administration)
Support Kubernetes workloads (pods services basic troubleshooting)
Troubleshoot network connectivity issues (TCP/IP DNS)
Maintain documentation runbooks and incident records
Support AI/ML platform operations in collaboration with data teams
Required Skills
Experience in IT infrastructure / platform support (NOC/SRE/Support)
Strong Linux fundamentals
Basic knowledge of networking (TCP/IP DNS)
Understanding of Kubernetes & containerization
Scripting skills in Python and/or JavaScript
Preferred Skills
Exposure to Docker Kubernetes (hands-on)
Familiarity with Ansible/Puppet (automation tools)
Awareness of AI/ML platforms or GPU environments
Key Attributes
Strong troubleshooting and analytical skills
Ability to work in 24x7/shift environments
Good communication and documentation skills
Experience
Required Experience:
IC
Role SummaryResponsible for day-to-day support of a high-availability AI/ML platform ensuring stability across Linux systems Kubernetes environments and enterprise infrastructure. The role focuses on operational support incident management and basic troubleshooting in a regulated production environm...
Role Summary
Responsible for day-to-day support of a high-availability AI/ML platform ensuring stability across Linux systems Kubernetes environments and enterprise infrastructure. The role focuses on operational support incident management and basic troubleshooting in a regulated production environment.
Key Responsibilities
Perform routine platform operations (access requests monitoring health checks)
Handle ticket triage follow runbooks and escalate when needed
Manage Linux systems (logs services basic administration)
Support Kubernetes workloads (pods services basic troubleshooting)
Troubleshoot network connectivity issues (TCP/IP DNS)
Maintain documentation runbooks and incident records
Support AI/ML platform operations in collaboration with data teams
Required Skills
Experience in IT infrastructure / platform support (NOC/SRE/Support)
Strong Linux fundamentals
Basic knowledge of networking (TCP/IP DNS)
Understanding of Kubernetes & containerization
Scripting skills in Python and/or JavaScript
Preferred Skills
Exposure to Docker Kubernetes (hands-on)
Familiarity with Ansible/Puppet (automation tools)
Awareness of AI/ML platforms or GPU environments
Key Attributes
Strong troubleshooting and analytical skills
Ability to work in 24x7/shift environments
Good communication and documentation skills
Experience
Required Experience:
IC
View more
View less