Site Reliability Engineer – Observability & Automation
Job Summary
Job Description:
About DXC
DXC helps global companies run their mission-critical systems and operations while modernizing IT optimizing data architectures and ensuring security and scalability across public private and hybrid clouds. The worlds largest companies and public sector organizations trust DXC to deploy servicesto drive new levels of performance competitiveness and customer experience across theirIT estates.
Our more than 125000 people in 70-plus countries are entrusted by our customers to deliver transformative technologies to ensure the success safety and well-being of businesses and people around the world. By combining strengths and expertise globally we create solutions and deliver greater outcomes for customers across their entire IT estate. Learn more about how we deliver excellence for our customers and colleagues at.
About the team
Our Site Reliability Engineering team ensures the reliability scalability performance and availability of critical systems through deep observability automation and continuous engineering improvement. We act as the eyes and the hands of production: we instrument and monitor everything that matters and we automate everything that shouldnt be done by hand. The team partners with development operations cloud infrastructure networking and security stakeholders to reduce toil increase system visibility accelerate incident response and keep highly available cloud and on-prem environments healthy.
About the Role
As a Site Reliability Engineer Observability & Automation you will be responsible for designing end-to-end observability across our platforms and building the automation that keeps them stable and self-healing. You will own monitoring logging tracing and alerting strategies using tools such as Dynatrace the Elastic Stack Zabbix and Grafana and you will turn manual operational work into repeatable auditable automation with Ansible and AWX. Youll partner closely with development and infrastructure teams to make systems more transparent more reliable and less dependent on human intervention.
Essential Job Functions
- Design implement and maintain end-to-end observability(metrics logs traces and alerting) across cloud and on-prem systems.
- Build and tune dashboards and visualizations in Grafana correlating data from multiple sources for fast diagnosis.
- Implement APM full-stack monitoring and root-cause analysis with Dynatrace including service-level and infrastructure-level visibility.
- Design and operate centralized log management and analytics with the Elastic Stack(Elasticsearch Kibana Logstash/Beats) including parsing indexing and retention strategies.
- Configure and maintain infrastructure and network monitoring with Zabbix including templates triggers and intelligent alerting.
- Develop automation with Ansible and AWX/Tower: playbooks roles inventories workflows and scheduled jobs to eliminate repetitive operational tasks.
- Build automated incident detection and self-remediationflows that reduce mean-time-to-detect (MTTD) and mean-time-to-resolve (MTTR).
- Write scripts and tooling(Python Bash PowerShell) to integrate systems enrich monitoring data and extend automation.
- Define instrument and track SLOs SLIs and SLAs and reduce alert noise by improving signal quality.
- Collaborate with development teamsto improve application reliability performance and instrumentation contributing code and reviewing changes when needed.
- Support cloud and network infrastructureto ensure high availability capacity and performance across environments.
- Participate in incident management root cause analysis and blameless post-mortems turning findings into automation and monitoring improvements.
Basic Qualifications
- Bachelors degree in Computer Science Engineering Information Technology or a related field or an equivalent combination of education and experience.
- 35 years of experience in Site Reliability Engineering DevOps Monitoring/Observability Cloud or related roles.
- Hands-on experience with observability and monitoring platforms ideally including several of: Dynatrace Elasticsearch / Elastic Stack Zabbix and Grafana.
- Proven experience building automation with Ansible(playbooks roles inventories) and orchestrating it through AWX or Ansible Tower/AAP.
- Strong scripting and programming skillsin Python Bash PowerShell or similar with the ability to integrate APIs and build operational tooling.
- Software development knowledge(version control with Git code review practices and comfort reading/contributing to application code).
- Experience working with cloud platformssuch as AWS Azure or Google Cloud Platform.
- Networking fundamentals: TCP/IP DNS HTTP/S load balancing firewalls and basic troubleshooting of connectivity and latency issues.
- Solid understanding of system reliability practices distributed systems and SLO/SLI/SLA concepts.
- Advanced English level required for global collaboration.
Other Qualifications (Preferred / Nice to Have)
- Experience designing centralized logging and alerting at scale(log pipelines retention cost/performance tuning).
- Experience with containerization and orchestration(Docker Kubernetes) and how to observe and automate them.
- Familiarity with Infrastructure as Code(Terraform CloudFormation) complementing Ansible-based configuration management.
- Experience with CI/CD tooling(GitLab CI GitHub Actions Jenkins or similar).
- Experience implementing automated incident response and self-healingsystems.
- Exposure to security and compliancein cloud environments.
- Certifications in Dynatrace Elastic Red Hat (Ansible/RHCE) or a cloud provider (AWS/Azure/GCP)are a plus.
- Strong analytical thinking troubleshooting and a proactive mindset focused on automation efficiency and continuous improvement.
- Ability to work under pressure and handle critical incidents.
Why DXC / Life at DXC
DXC is an employer of choice with strong values and fosters a culture of inclusion belonging and corporate citizenship. We inspire and take care of our people. We work to create a culture of learningdiversity and inclusionand are dedicated to strongethicsandcorporate citizenship. DXC is where brilliant people seize opportunities to advance their careers and amplify customer success.
Full-time hires are eligible to participate in the DXC benefit program. DXC offers a comprehensive flexible and competitive benefits program which includes but is not limited to health dental and vision insurance coverage; employee wellness; life and disability insurance; a retirement savings plan paid holidays paid time off.
At DXC Technology we believe strong connections and community are key to our success. Our work model prioritizes in-person collaboration while offering flexibility to support wellbeing productivity individual work styles and life circumstances. Were committed to fostering an inclusive environment where everyone can thrive.
Recruitment fraud is a scheme in which fictitious job opportunities are offered to job seekers typically through online services such as false websites or through unsolicited emails claiming to be from the company. These emails may request recipients to provide personal information or to make payments as part of their illegitimate recruiting process. DXC does not make offers of employment via social media networks and DXC never asks for any money or payments from applicants at any point in the recruitment process nor ask a job seeker to purchase IT or other equipment on our information on employment scams is availablehere.
Required Experience:
IC
About Company
Created by the merger of CSC and the Enterprise Services business of Hewlett Packard Enterprise, DXC Technology boasts a long and proud history of innovation, service and value. In 1959, computer analysts Roy Nutt and Fletcher Jones pooled $100 to form CSC, providing computer manufac ... View more