Description
Join us as we pursue our groundbreaking vision to make machine data accessible usable and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk we are committed to our work customers having fun and most significantly to each others success.
Role
You will help us run one of the largest and most sophisticated cloudscale bigdata and microservices platforms in the world. You will be responsible to monitor and resolve issues that affect the availability and performance of critical components of Splunk Observability Cloud. You will use your Kubernetes cloud and infrastructureascode knowledge to enhance Splunk Observability Cloud infrastructure while reducing its operational costs.
As such you will be providing oncall support & incident management for our customers. To ensure coverage you will work a 40 hour MonFri week and be available for production support on a rotating basis on either a Saturday or/& Sunday. The flexible rotating roster is intended to balance employee wellbeing and business requirements to ensure customer expectations are met.
Responsibilities:
- Respond to monitoring alerts according to defined playbooks and procedures.
- Enhance playbooks and procedures to reduce oncall toil.
- Participate in Post Incident Reviews and discussions.
- Ensure stability and performance of production environments.
- Deploy software to production environments.
- Build effective working relationships with crossfunctional team members
- Make suggestions for process improvements and enhance operational efficiencies.
- Implement various process improvements and operational efficiencies.
Qualifications:
- 5 years related experience in Cloud Operations.
- You have experience with Cloud Computing Platforms such as AWS and GCP.
- You have experience with Kubernetes and Docker.
- You have experience with one or more scripting languages such as Python Bash etc.
- You have 2 years in incident response and major incident management.
- You enjoy problemsolving and analyzing globalscale distributed systems.
- You are collaborative with strong interpersonal and communication skills both verbal and written.
- You remain calm and collected in stressful situations such as a major service outage.
- You demonstrate attention to detail followthrough and the ability to prioritize quickly.
- You demonstrate good judgment on when to solve problems individually and when to involve others.
- Experience in Infrastructureascode Terraform Helm YAML.
Nice to have:
- Experience handling SaaS applications for a large customer base.
- Experience with CI/CD frameworks and PipelineasCode such as Jenkins Gitlab Artifactory etc.
- Familiarity with microservices fundamentals including Service Mesh using Istio service discovery deployment strategies monitoring scheduling and load balancing.
We are an equalopportunity employer and value diversity at our company. We do not discriminate on the basis of race religion color national origin gender sexual orientation age marital status veteran status or disability status.
We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race color religion sex sexual orientation gender identity national origin or any other applicable legally protected characteristics in the location in which the candidate is applying.
Note:
Thank you for your interest in Splunk!