In office Position
POSITION SUMMARY:
The Senior Systems Engineer is a hands-on senior individual contributor responsible for designing building and operating TRISTARs core infrastructure platform with a strong emphasis on Linux systems Kubernetes and automation. This role will own the Kubernetes platform end-to-endcluster build lifecycle management operational standards reliability and day-2 operationswhile partnering closely with development teams as TRISTAR transitions toward a DevOps operating model. Success in this role requires deep technical ownership strong troubleshooting skills across distributed systems and the ability to improve reliability through thoughtful design observability and repeatable automation.
ESSENTIAL DUTIES AND RESPONSIBILITIES:
Kubernetes Platform Engineering & Lifecycle:
Design build and operate Kubernetes clusters in production including upgrades
patching scaling and reliability improvements.
Establish platform standards and operating practices as the environment matures
(cluster configuration access patterns resource governance and runbooks).
Serve as the senior escalation point for Kubernetes platform issues and drive resolution
through root-cause analysis and prevention.
Kubernetes Storage Backup/Restore & Disaster Recovery:
Design and implement Kubernetes storage patterns (StorageClasses PV/PVC lifecycle
capacity planning) and support stateful workloads.
Implement test and maintain Kubernetes-native backup/restore and recovery
procedures.
Integrate Kubernetes persistence needs with enterprise storage platforms including Dell
ObjectScale and existing virtualization/storage systems.
Ingress Load Balancing & Kubernetes Networking:
Own Kubernetes traffic entry including ingress controllers load balancers routing
patterns and TLS/certificate handling.
Define repeatable patterns for exposing services and troubleshooting connectivity across
platform components.
Linux Systems Engineering:
Administer and harden Linux systems that support the platform including patching
performance tuning service reliability logging and baseline configuration.
Troubleshoot system and platform issues across compute storage and network
dependencies.
Automation Scripting & API Integrations:
Build automation to reduce manual work and increase consistency across infrastructure
operations using Python/PowerShell/Bash and API-driven workflows.
Evaluate recommend and help implement an automation / configuration management
approach (tooling patterns and standards) to support repeatable tasks such as
provisioning configuration enforcement patching drift detection and validation.
Develop reusable automation assets (modules/playbooks/templates/scripts) and
establish version-controlled workflows (Git) documentation and operational handoff
practices.
Leverage RESTful APIs to integrate systems and create operational workflows (health
checks reporting event-driven automations and change validation).
Monitoring Alert Response & Operational Reporting:
Monitor alert sources and observability tooling (including SolarWinds on-prem)
investigate events and drive issues to completion.
Document incidents actions taken and final resolutions contribute to improved alerting
quality and operational visibility.
Data Center Support (Occasional):
Provide occasional on-site support as needed in the data center for infrastructure prep
and troubleshooting (racking equipment cabling and physical connectivity verification).
Maintain working familiarity with server hardware and data center best practices to
support rare hands-on needs.
Cloud Readiness & Future-State Hosting:
Partner with development and infrastructure teams to plan and progress TRISTARs
long-term transition toward cloud-hosted deployments of the application stack
Contribute to cloud design discussions with a practical understanding of core cloud
concepts (networking identity/access security reliability scalability and cost
considerations) across major providers (AWS/Azure/GCP).
Translate application and platform requirements into cloud-ready operational patterns
(container orchestration in cloud managed services vs self-managed tradeoffs
environment isolation per client and deployment repeatability).
Support early-stage cloud initiatives such as proofs of concept reference architectures
and migration planning including identifying skill/tooling gaps and recommending
realistic next steps.
Apply Infrastructure-as-Code and automation principles to cloud readiness efforts to
ensure future deployments are repeatable supportable and auditable.
Documentation & Technical Standards:
Create and maintain IT documentation including platform runbooks operational
procedures and architecture/standards documentation.
Collaboration Service Desk Support & Cross-Team Execution:
Work with the Manager Network Services and general IT staff to analyze and resolve
technical issues affecting infrastructure and applications.
Partner closely with development teams as part of TRISTARs DevOps transition to
improve operability deployment reliability and platform usability.
Work alongside the service desk to remedy end-user workstation issues; backfill and
answer service desk calls when required.
Schedule Flexibility & Travel:
Perform night/day/weekend work as required to meet project objectives and support
maintenance windows.
Traveling to remote sites is rare but possible and may be required as needed
QUALIFICATIONS REQUIRED:
Education/Experience: Bachelors degree in a related field (preferred); minimum of 7-year
related experience; or equivalent combination of education and experience.
Knowledge Skills and Abilities:
7 years of progressively responsible experience in systems/infrastructure engineering
with strong production experience in Linux administration.
Hands-on production experience with Kubernetes including cluster build and lifecycle
management (architecture upgrades patching scaling troubleshooting).
Strong understanding of Kubernetes storage and stateful workload operations including
troubleshooting PV/PVC and storage provisioning patterns.
Experience implementing Kubernetes-native backup/restore practices and validating
recovery procedures.
Demonstrated automation experience using scripting (Python/PowerShell/Bash) and
leveraging RESTful APIs for systems integration and automation.
Experience with monitoring/observability platforms and operational alerting; SolarWinds
experience strongly preferred.
Strong troubleshooting skills across distributed systems networking fundamentals and
infrastructure dependencies.
Strong written and verbal communication skills including
documentation/runbooks/standards.
EQUIPMENT OPERATED/USED: Computer 10-key printer copier fax machine and other
office equipment.
SPECIAL EQUIPMENT OR CLOTHING: Appropriate office attire.
Required Experience:
Senior IC