DevOps / Systems Engineer
Role Overview
We are looking for a DevOps/Systems Engineer with hands-on production experience who
can design build and maintain reliable infrastructure and deployment pipelines. This role
requires someone who has moved beyond fundamentals and can independently own
systems troubleshoot complex issues and drive continuous improvement across our
platform.
Technical Skills
Must Have (Non-Negotiable)
Linux Administration - Strong command over system internals process
management cgroups performance tuning kernel parameters and advanced
troubleshooting
Docker (Production-Grade) - Multi-stage builds image optimization private
registry management security best practices and understanding container
limitations
Kubernetes (Hands-On) - Managing clusters in production deployments services
configmaps secrets resource limits HPA network policies and debugging pod
failures
CI/CD Pipeline Design - Building and maintaining robust pipelines with proper
stages caching parallel execution artifact management and security scanning
Infrastructure as Code - Proficiency in Terraform (modules state management
workspaces) or equivalent; Ansible for configuration management with idempotent
playbooks
Python - Writing maintainable automation tools CLIs API integrations and scripts
that follow coding standards
YAML & Templating - Helm charts Kustomize and managing configuration
complexity without duplication
Cloud Platforms - Strong working knowledge of AWS/GCP/Azure including
networking (VPC subnets security groups) IAM compute and storage services
RPM/Package Management - Building packages managing repositories handling
dependencies and versioning strategies
Networking - Load balancing DNS management SSL/TLS proxy configurations
and troubleshooting connectivity issues
Good to Have
GitOps Practices - Experience with ArgoCD Flux or similar continuous delivery
tools
Service Mesh - Understanding of Istio Linkerd or similar technologies and their
use cases
Observability Stack - Experience with logging (ELK/Loki) metrics
(Prometheus/Grafana) and tracing (Jaeger/Zipkin)
Security Practices - Secret management (Vault) supply chain security
vulnerability scanning and least-privilege implementation
Chaos Engineering - Familiarity with fault injection and resilience testing
Database Operations - Basic DBA tasks backup strategies and performance
monitoring
Cost Optimization - Experience with cloud cost management and resource right
sizing
Deployment & Operational Knowledge
Must Have (Non-Negotiable)
Deployment Strategies - Hands-on experience with blue/green canary and rolling
deployments
Rollback Procedures - Ability to design and execute rollback plans; every
deployment must have an exit strategy
Health Checks & Validation - Implementing readiness/liveness probes smoke
tests and deployment validation
Incident Response - Experience handling production incidents conducting root
cause analysis and writing post-mortems
Awareness of Anti-Patterns (Must Actively Avoid)
Deploying without rollback capability
Manual or undocumented deployment procedures
Configuration drift between environments
Skipping pre-production validation
Direct changes in production without change management
Ignoring monitoring alerts or deferring investigation
"Quick fixes" without addressing root causes
Testing & Validation
Must Have
Infrastructure Testing - Writing tests for IaC (Terratest kitchen-terraform or
equivalent)
Pipeline Validation - Smoke tests integration tests as part of deployment
Environment Parity - Ensuring consistency across dev staging and production
Good to Have
Load Testing - Experience with performance testing tools and capacity planning
Security Testing - SAST/DAST integration in pipelines
Mindset & Behavioral Expectations
Must Have (Non-Negotiable)
Deep Debugging Ability - Does not stop at surface-level fixes; traces issues to root
cause through logs metrics and system analysis; genuinely enjoys the process of
understanding why something failed
Relentless Persistence - Stays with difficult problems until resolved; escalates
appropriately but does not abandon issues; comfortable with ambiguity and
incomplete information
Continuous Improvement Drive - Proactively identifies inefficiencies in systems
and processes; proposes and implements improvements without being asked;
dissatisfied with "good enough" when better is achievable
Automation-First Thinking - Recognizes repetitive tasks and builds automation;
avoids manual processes wherever possible
Ownership & Accountability - Takes end-to-end responsibility for systems; does
not pass problems to others without context; follows through on commitments
Clear Communication - Documents decisions and processes; explains complex
technical issues to varied audiences; writes useful runbooks and post-mortems
Good to Have
Mentorship Ability - Willingness to guide junior engineers and share knowledge
Cross-Team Collaboration - Experience working with development QA and
security teams to improve overall delivery
Blameless Culture Advocate - Focuses on system improvements rather than
individual blame during incidents
Experience
2-5 years of hands-on DevOps SRE or Platform Engineering experience
Demonstrated experience managing production systems
Track record of building and improving CI/CD pipelines and infrastructure
automation
Responsibilities
Design and maintain CI/CD pipelines and deployment infrastructure
Manage Kubernetes clusters and container orchestration
Implement and maintain Infrastructure as Code
Participate in on-call rotations and incident response
Drive platform reliability and developer experience improvements
Conduct capacity planning and cost optimization
Mentor associate engineers and contribute to team knowledge sharing
Required Skills:
devops
IT Services and IT Consulting