Kubernetes Platform Engineer
Milpitas, CA - USA
Job Summary
Role Overview
We are seeking a hands-on Infrastructure Engineer responsible for building operating and scaling Kubernetes clusters on physical (bare-metal) servers. This role focuses on data center infrastructure hardware provisioning storage automation and cluster reliability. The ideal candidate has strong experience working in on-prem environments and is not primarily cloud-focused.
Key Responsibilities
-
Deploy configure and maintain Kubernetes clusters on bare-metal infrastructure
-
Perform rack & stack hardware provisioning cabling and server lifecycle management
-
Implement Infrastructure as Code (IaC) using Terraform and Ansible
-
Design and manage persistent storage solutions (SAN NAS Ceph or similar)
-
Monitor cluster health performance and availability using observability tools
-
Implement high availability backup and disaster recovery strategies
-
Manage networking configurations including VLANs load balancing and DNS
-
Troubleshoot hardware OS networking and cluster-level issues
-
Collaborate with platform DevOps and application teams to ensure reliability and scalability
-
Maintain documentation for infrastructure processes and runbooks
Required Skills & Expertise
Core Infrastructure
-
Strong experience with bare-metal server provisioning & data center operations
-
Hands-on experience with rack & stack and physical server management
-
Linux system administration (RHEL Ubuntu or similar)
Container & Orchestration
-
Deep expertise in Kubernetes cluster deployment and operations
-
Experience managing multi-node clusters in production environments
Infrastructure Automation
-
Terraform & Ansible for provisioning and configuration management
-
Infrastructure as Code (IaC) best practices
Storage & Networking
-
Experience with persistent storage solutions (Ceph GlusterFS SAN/NAS)
-
Networking fundamentals: VLANs routing DNS load balancing
Monitoring & Reliability
-
Monitoring & logging tools such as Prometheus Grafana ELK stack
-
Experience ensuring high availability and system reliability
Preferred Qualifications
-
Experience with container runtimes and networking (CRI-O Containerd CNI plugins)
-
Knowledge of disaster recovery and backup strategies
-
Exposure to security hardening and compliance practices
-
Scripting skills (Bash Python)