Senior DevOps and SRE Engineer

Washington, AR - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical senior-level role is responsible for driving the reliability performance security and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development infrastructure-as-code and observability to automate operational toil lead capacity planning and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs/SLOs/Error Budgets) mentoring team members and proactively influencing cross-functional teams to achieve world-class operational excellence.

ResponsibilitiesDeployment & Automation Engineering

Implement maintain and optimize robust CI/CD pipelines utilizing tools such as GitHub Actions AWS CodePipeline and Jenkins.
Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform CloudFormation or AWS CDK.
Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
Proficiency in multiple programming languages (Python Go Java) to develop automation and troubleshoot applications.

Site Reliability & Observability

Serve as a production on-call responder leading incident management and orchestrating critical service outages and disaster recovery failover activities.
Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.
Define monitor and enforce Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets.
Expertly leverage observability tools (Dynatrace AppDynamics ELK Stack Dynatrace strongly preferred) for proactive monitoring and troubleshooting.
Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.
Design and implement custom dashboards and anomaly detectors to generate actionable insights.

Capacity Performance & Cost Management

Develop sophisticated capacity models and forecasting systems to ensure service scalability.
Lead cost optimization initiatives identifying and implementing efficiency gains across cloud services.
Design and execute comprehensive Resiliency and Performance testing frameworks.
Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.

Security & Governance

Lead security incident investigations and execute swift remediation plans.
Design and implement automated compliance validation and security automation frameworks.
Drive the implementation of zero-trust architecture patterns within the cloud environment.
Proficiently apply ITIL framework principles preferably leveraging ITSM tools such as ServiceNow.

QualificationsEducation & Experience

Bachelors degree in Computer Science Engineering or a related technical field.
5 to 8 years of progressive experience in DevOps Site Reliability Engineering (SRE) or Platform Engineering.
3 years of experience maintaining and optimizing high-availability production environments.
Proven track record of leading complex technical initiatives from conception to completion.

Technical Expertise

Expert-level knowledge of at least one major cloud platform with AWS strongly preferred.
Deep expertise in cloud architecture networking and core services.
High proficiency in IaC tools such as Terraform CloudFormation or AWS CDK.
Expert-level experience with observability and APM tools with a strong preference for Dynatrace.
Proficiency in modern programming languages like Python Go or Java.
Knowledge of relational cloud-native and NoSQL database technologies.

Professional & Leadership Skills

Strong leadership and mentoring capabilities with the ability to elevate the technical skills of the team.
Exceptional ability to influence without direct authority across engineering and product teams.
Excellent technical writing and documentation skills (e.g. RCA development Knowledge articles).
Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.

Required Skills :

Basic Qualification :

Additional Skills :

This is a high PRIORITY requisition. This is a PROACTIVE requisition

Background Check : No

Drug Screen : No