We are looking for an SRE / OPS / DevOps Engineer to join our infrastructure practice this is a cross-functional role that combines Site Reliability Engineering Operations and DevOps disciplines. You will be embedded in a team that owns and operates complex multi-datacenter infrastructure supporting multiple high-traffic gaming studios at Playtika.
You will work alongside experienced engineers across SRE DBA and Platform teams to keep our services reliable scalable and observable.
Responsibilities
- Operate and maintain Kubernetes-based infrastructure (SpectroCloud / Cloudstack) across multiple datacenters in the US.
- Support and troubleshoot a wide range of stateful workloads running in k8s: Kafka Redis / KeyDB / RedisLabs MariaDB / Galera Aerospike Singlestore Elasticsearch / OpenSearch
- Participate in cloud migration projects prepare environments perform data sync and failover execute final cutover
- Manage and maintain monitoring alerting and observability stacks: Prometheus VictoriaMetrics Grafana AlertManager PagerDuty
- Maintain and improve logging infrastructure: ELK / OpenSearch stack Filebeat Logstash including configuration performance tuning and index lifecycle management
- Configure and maintain load balancers (Nginx MaxScale internal PLB / IPVS-based solutions) and manage DNS records
- Manage SSL certificate lifecycle automation (Sectigo Prometheus / Grafana)
- Administer and maintain secrets management systems (HashiCorp Vault External Secrets Operator)
- Participate in on-call duty rotation and respond to production incidents; follow and improve SRE alert handling guidelines
- Contribute to GitOps workflows: maintain infrastructure-as-code in Git work with Flux CD deployment packages and Ansible playbooks
- Review and extend automation scripts and tooling (primarily Bash and Python)
- Provide SRE-level support to multiple game studios: investigate production issues handle ChatOps requests and collaborate with development teams
- Write and maintain operational runbooks SOPs migration plans and other technical documentation in Confluence
- Perform capacity planning reviews and resource utilization analysis for datastores and cluster nodes
- Participate in cross-team initiatives and contribute to platform-level improvements
Requirements
- Solid hands-on experience with Linux systems (primarily Ubuntu 22.04 / 24.04 LTS)
- Practical knowledge of Kubernetes administration: workloads operators resource management node maintenance cluster upgrades
- Experience operating and troubleshooting stateful services in k8s: at least one of Kafka Redis / KeyDB MariaDB / Galera Elasticsearch / OpenSearch Aerospike
- Familiarity with GitOps approach and tools: Git Flux CD Helm Kustomize
- Monitoring and observability experience: Prometheus ecosystem VictoriaMetrics Grafana AlertManager
- Practical experience with ELK / OpenSearch stack (Filebeat Logstash index management)
- Solid scripting skills (Bash); ability to read and modify Python
- Understanding of networking fundamentals: TCP/IP DNS load balancing ports and protocols VIPs VLANs
- Experience with HashiCorp Vault or similar secrets management solutions
- Familiarity with Ansible for infrastructure automation
- Ability to troubleshoot complex distributed system issues under production pressure
- Strong communication skills: ability to collaborate across SRE DBA RnD and NOC teams
- Experience working with Jira-based workflow and documenting in Confluence
Nice to Have
- Experience with cloud migration projects (datacenter-to-cloud or cloud-to-datacenter)
- Knowledge of additional datastores: Singlestore (SingleStore / MemSQL) Aerospike Couchbase KeyDB
- Familiarity with HashiCorp Boundary for secure remote access and Ansible dynamic inventory
- Experience with load balancer solutions: Nginx F5 IPVS
- Understanding of high-availability and DR patterns: active-active active-passive failover procedures
- Exposure to SSL certificate lifecycle automation (cert-manager Sectigo)
- Knowledge of PagerDuty or similar on-call and incident management platforms
- Experience with AWS (IAM S3 EC2) in the context of infrastructure operations
- Understanding of SLO/SLA concepts and how they apply to infrastructure reliability
- Familiarity with Python-based monitoring collectors or custom exporters for Prometheus
- Experience with capacity planning performance analysis and resource optimization in production environments
Our Stack at a Glance
Container Orchestration | Kubernetes (SpectroCloud Cloudstack) Flux CD Helm Kustomize |
Databases & Datastores | MariaDB / Galera MaxScale Singlestore Aerospike Redis / KeyDB RedisLabs (Redis Enterprise) Couchbase |
Message Brokers | Apache Kafka Kafka Mirror Maker |
Search & Logging | Elasticsearch / OpenSearch (ECK) Kibana / OpenSearch Dashboards Filebeat Logstash |
Monitoring | Prometheus VictoriaMetrics Grafana AlertManager PagerDuty |
Load Balancing | Nginx IPVS MaxScale |
Secrets & Access | HashiCorp Vault External Secrets Operator HashiCorp Boundary |
Automation & CI | Ansible Jenkins Bash Python |
Cloud & Infra | CloudStack AWS DNS Sectigo (SSL automation) |
Collaboration | Jira Confluence Git (GitHub) Teams |
Required Experience:
IC
We are looking for an SRE / OPS / DevOps Engineer to join our infrastructure practice this is a cross-functional role that combines Site Reliability Engineering Operations and DevOps disciplines. You will be embedded in a team that owns and operates complex multi-datacenter infrastructure supportin...
We are looking for an SRE / OPS / DevOps Engineer to join our infrastructure practice this is a cross-functional role that combines Site Reliability Engineering Operations and DevOps disciplines. You will be embedded in a team that owns and operates complex multi-datacenter infrastructure supporting multiple high-traffic gaming studios at Playtika.
You will work alongside experienced engineers across SRE DBA and Platform teams to keep our services reliable scalable and observable.
Responsibilities
- Operate and maintain Kubernetes-based infrastructure (SpectroCloud / Cloudstack) across multiple datacenters in the US.
- Support and troubleshoot a wide range of stateful workloads running in k8s: Kafka Redis / KeyDB / RedisLabs MariaDB / Galera Aerospike Singlestore Elasticsearch / OpenSearch
- Participate in cloud migration projects prepare environments perform data sync and failover execute final cutover
- Manage and maintain monitoring alerting and observability stacks: Prometheus VictoriaMetrics Grafana AlertManager PagerDuty
- Maintain and improve logging infrastructure: ELK / OpenSearch stack Filebeat Logstash including configuration performance tuning and index lifecycle management
- Configure and maintain load balancers (Nginx MaxScale internal PLB / IPVS-based solutions) and manage DNS records
- Manage SSL certificate lifecycle automation (Sectigo Prometheus / Grafana)
- Administer and maintain secrets management systems (HashiCorp Vault External Secrets Operator)
- Participate in on-call duty rotation and respond to production incidents; follow and improve SRE alert handling guidelines
- Contribute to GitOps workflows: maintain infrastructure-as-code in Git work with Flux CD deployment packages and Ansible playbooks
- Review and extend automation scripts and tooling (primarily Bash and Python)
- Provide SRE-level support to multiple game studios: investigate production issues handle ChatOps requests and collaborate with development teams
- Write and maintain operational runbooks SOPs migration plans and other technical documentation in Confluence
- Perform capacity planning reviews and resource utilization analysis for datastores and cluster nodes
- Participate in cross-team initiatives and contribute to platform-level improvements
Requirements
- Solid hands-on experience with Linux systems (primarily Ubuntu 22.04 / 24.04 LTS)
- Practical knowledge of Kubernetes administration: workloads operators resource management node maintenance cluster upgrades
- Experience operating and troubleshooting stateful services in k8s: at least one of Kafka Redis / KeyDB MariaDB / Galera Elasticsearch / OpenSearch Aerospike
- Familiarity with GitOps approach and tools: Git Flux CD Helm Kustomize
- Monitoring and observability experience: Prometheus ecosystem VictoriaMetrics Grafana AlertManager
- Practical experience with ELK / OpenSearch stack (Filebeat Logstash index management)
- Solid scripting skills (Bash); ability to read and modify Python
- Understanding of networking fundamentals: TCP/IP DNS load balancing ports and protocols VIPs VLANs
- Experience with HashiCorp Vault or similar secrets management solutions
- Familiarity with Ansible for infrastructure automation
- Ability to troubleshoot complex distributed system issues under production pressure
- Strong communication skills: ability to collaborate across SRE DBA RnD and NOC teams
- Experience working with Jira-based workflow and documenting in Confluence
Nice to Have
- Experience with cloud migration projects (datacenter-to-cloud or cloud-to-datacenter)
- Knowledge of additional datastores: Singlestore (SingleStore / MemSQL) Aerospike Couchbase KeyDB
- Familiarity with HashiCorp Boundary for secure remote access and Ansible dynamic inventory
- Experience with load balancer solutions: Nginx F5 IPVS
- Understanding of high-availability and DR patterns: active-active active-passive failover procedures
- Exposure to SSL certificate lifecycle automation (cert-manager Sectigo)
- Knowledge of PagerDuty or similar on-call and incident management platforms
- Experience with AWS (IAM S3 EC2) in the context of infrastructure operations
- Understanding of SLO/SLA concepts and how they apply to infrastructure reliability
- Familiarity with Python-based monitoring collectors or custom exporters for Prometheus
- Experience with capacity planning performance analysis and resource optimization in production environments
Our Stack at a Glance
Container Orchestration | Kubernetes (SpectroCloud Cloudstack) Flux CD Helm Kustomize |
Databases & Datastores | MariaDB / Galera MaxScale Singlestore Aerospike Redis / KeyDB RedisLabs (Redis Enterprise) Couchbase |
Message Brokers | Apache Kafka Kafka Mirror Maker |
Search & Logging | Elasticsearch / OpenSearch (ECK) Kibana / OpenSearch Dashboards Filebeat Logstash |
Monitoring | Prometheus VictoriaMetrics Grafana AlertManager PagerDuty |
Load Balancing | Nginx IPVS MaxScale |
Secrets & Access | HashiCorp Vault External Secrets Operator HashiCorp Boundary |
Automation & CI | Ansible Jenkins Bash Python |
Cloud & Infra | CloudStack AWS DNS Sectigo (SSL automation) |
Collaboration | Jira Confluence Git (GitHub) Teams |
Required Experience:
IC
View more
View less