Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailJob Description:
About Organization
The core objective of Service Operation is to ensure the effective and efficient delivery of Rakuten Mobiles Telco/IT services to all stakeholders. This encompasses maintaining highly available secure and performant OSS (Operations Support Systems) platforms across both cloud and network domains.
We achieve this through:
Reliability & Resilience:Defining and meeting SLIs/SLOs proactive monitoring blameless RCAs and engineering HA/DR architectures with runbooks to meet agreed RTO/RPO.
Automation & Efficiency:Increasing automation via IaC CI/CD and policy-as-code to reduce toil standardize environments and optimize cloud spend through FinOps without compromising SLOs or security.
Collaboration & Improvement:Enabling product teams to meet non-functional requirements and ship reliably and driving continuous improvement through documentation mentorship and uplift of engineering standards.
Our Service Operation Management team plays a crucial role in enhancing organizational efficiency and productivity through automated systems and processes. This role covers a broad technical scope including cloud networking Kubernetes observability security and FinOps with end-to-end ownership influencing architecture standards and operational strategy. We leverage modern tooling and practices (e.g. Terraform/Ansible Kubernetes/OpenShift Prometheus/Grafana ELK/OpenSearch Datadog/New Relic GitHub/GitLab/Jenkins) and apply strong engineering rigor (DR drills load/failover testing chaos/experimentation) while collaborating cross-functionally across various teams.
Job Duties
OSS Platform & Architecture Subject Matter Expert (SME):
Performance tuning
High Availability (HA) / Disaster Recovery (DR) design and testing
Observability engineering
Cloud networking
Incident leadership and Root Cause Analysis (RCA)
Capacity and demand forecasting
FinOps strategy and implementation
Mentoring and setting engineering standards
Minimum Qualifications
Education:Bachelors or Masters degree in Engineering or Technology.
Experience:10 years of total experience in Site Reliability Engineering (SRE) DevOps or Cloud Infrastructure Architecture.
Technical Skills:
Cloud Platforms:AWS/Azure/GCP
Networking:VPC/VNet ALB/ELB/NGINX WAF DNS/CDN
IaC/Automation:Terraform Ansible; Scripting: Python/Bash/Go
Containers/Orchestration:Docker Kubernetes/OpenShift; Service Mesh (Istio/Linkerd) a plus
Observability:Prometheus Grafana ELK/OpenSearch Datadog/New Relic OpenTelemetry
CI/CD:GitHub Actions GitLab CI Jenkins Azure DevOps
SCM and Collaboration:Git Jira Confluence
Key Achievements/Experience:
Demonstrated success in DR/BCP readiness including meeting RTO/RPO targets in quarterly failover tests.
Proven ability to achieve infrastructure cost savings and unit cost efficiency improvements without impacting SLOs.
Languages:
English (Overall - 4 - Fluent)Full-Time