Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical senior-level role is responsible for driving the reliability performance security and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development infrastructure-as-code and observability to automate operational toil lead capacity planning and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs/SLOs/Error Budgets) mentoring team members and proactively influencing cross-functional teams to achieve world-class operational excellence.
ResponsibilitiesDeployment & Automation Engineering- Implement maintain and optimize robust CI/CD pipelines utilizing tools such as GitHub Actions AWS CodePipeline and Jenkins.
- Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform CloudFormation or AWS CDK.
- Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
- Proficiency in multiple programming languages (Python Go Java) to develop automation and troubleshoot applications.
Site Reliability & Observability- Serve as a production on-call responder leading incident management and orchestrating critical service outages and disaster recovery failover activities.
- Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.
- Define monitor and enforce Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets.
- Expertly leverage observability tools (Dynatrace AppDynamics ELK Stack Dynatrace strongly preferred) for proactive monitoring and troubleshooting.
- Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.
- Design and implement custom dashboards and anomaly detectors to generate actionable insights.
Capacity Performance & Cost Management- Develop sophisticated capacity models and forecasting systems to ensure service scalability.
- Lead cost optimization initiatives identifying and implementing efficiency gains across cloud services.
- Design and execute comprehensive Resiliency and Performance testing frameworks.
- Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.
Security & Governance- Lead security incident investigations and execute swift remediation plans.
- Design and implement automated compliance validation and security automation frameworks.
- Drive the implementation of zero-trust architecture patterns within the cloud environment.
- Proficiently apply ITIL framework principles preferably leveraging ITSM tools such as ServiceNow.
QualificationsEducation & Experience- Bachelors degree in Computer Science Engineering or a related technical field.
- 5 to 8 years of progressive experience in DevOps Site Reliability Engineering (SRE) or Platform Engineering.
- 3 years of experience maintaining and optimizing high-availability production environments.
- Proven track record of leading complex technical initiatives from conception to completion.
Technical Expertise- Expert-level knowledge of at least one major cloud platform with AWS strongly preferred.
- Deep expertise in cloud architecture networking and core services.
- High proficiency in IaC tools such as Terraform CloudFormation or AWS CDK.
- Expert-level experience with observability and APM tools with a strong preference for Dynatrace.
- Proficiency in modern programming languages like Python Go or Java.
- Knowledge of relational cloud-native and NoSQL database technologies.
Professional & Leadership Skills- Strong leadership and mentoring capabilities with the ability to elevate the technical skills of the team.
- Exceptional ability to influence without direct authority across engineering and product teams.
- Excellent technical writing and documentation skills (e.g. RCA development Knowledge articles).
- Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.
Required Skills :
Basic Qualification :
Additional Skills :
This is a high PRIORITY requisition. This is a PROACTIVE requisition
Background Check : No
Drug Screen : No
Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical senior-level role is responsible for driving the reliability performance security and scalability of high-availability producti...
Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical senior-level role is responsible for driving the reliability performance security and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development infrastructure-as-code and observability to automate operational toil lead capacity planning and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs/SLOs/Error Budgets) mentoring team members and proactively influencing cross-functional teams to achieve world-class operational excellence.
ResponsibilitiesDeployment & Automation Engineering- Implement maintain and optimize robust CI/CD pipelines utilizing tools such as GitHub Actions AWS CodePipeline and Jenkins.
- Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform CloudFormation or AWS CDK.
- Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
- Proficiency in multiple programming languages (Python Go Java) to develop automation and troubleshoot applications.
Site Reliability & Observability- Serve as a production on-call responder leading incident management and orchestrating critical service outages and disaster recovery failover activities.
- Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.
- Define monitor and enforce Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets.
- Expertly leverage observability tools (Dynatrace AppDynamics ELK Stack Dynatrace strongly preferred) for proactive monitoring and troubleshooting.
- Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.
- Design and implement custom dashboards and anomaly detectors to generate actionable insights.
Capacity Performance & Cost Management- Develop sophisticated capacity models and forecasting systems to ensure service scalability.
- Lead cost optimization initiatives identifying and implementing efficiency gains across cloud services.
- Design and execute comprehensive Resiliency and Performance testing frameworks.
- Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.
Security & Governance- Lead security incident investigations and execute swift remediation plans.
- Design and implement automated compliance validation and security automation frameworks.
- Drive the implementation of zero-trust architecture patterns within the cloud environment.
- Proficiently apply ITIL framework principles preferably leveraging ITSM tools such as ServiceNow.
QualificationsEducation & Experience- Bachelors degree in Computer Science Engineering or a related technical field.
- 5 to 8 years of progressive experience in DevOps Site Reliability Engineering (SRE) or Platform Engineering.
- 3 years of experience maintaining and optimizing high-availability production environments.
- Proven track record of leading complex technical initiatives from conception to completion.
Technical Expertise- Expert-level knowledge of at least one major cloud platform with AWS strongly preferred.
- Deep expertise in cloud architecture networking and core services.
- High proficiency in IaC tools such as Terraform CloudFormation or AWS CDK.
- Expert-level experience with observability and APM tools with a strong preference for Dynatrace.
- Proficiency in modern programming languages like Python Go or Java.
- Knowledge of relational cloud-native and NoSQL database technologies.
Professional & Leadership Skills- Strong leadership and mentoring capabilities with the ability to elevate the technical skills of the team.
- Exceptional ability to influence without direct authority across engineering and product teams.
- Excellent technical writing and documentation skills (e.g. RCA development Knowledge articles).
- Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.
Required Skills :
Basic Qualification :
Additional Skills :
This is a high PRIORITY requisition. This is a PROACTIVE requisition
Background Check : No
Drug Screen : No
View more
View less