Role Name: Site Reliability Engineer - Lead
Cincinnati OH - Hybrid only on w2
Role Description:
As a Site Reliability Engineer - Lead you will drive the reliability scalability and performance of mission-critical systems and services while leading a team of SREs. This role combines deep technical expertise with leadership mentoring and strategic planning. You will set standards for operational excellence guide incident response and foster a culture of automation and continuous improvement. Collaboration with engineering operations and product teams is essential to align reliability initiatives with business objectives and ensure seamless service delivery.
REQUIRED SKILL:
- Proven experience in site reliability DevOps or systems engineering with prior leadership or team lead responsibilities
- Strong programming/scripting skills (e.g. Python Go Bash or similar)
- Deep expertise in Linux/Unix system administration and networking
- Experience architecting and operating cloud platforms (AWS Azure GCP)
- Proficiency with infrastructure-as-code and automation tools (e.g. Terraform Ansible CloudFormation)
- Advanced knowledge of monitoring logging and alerting solutions (e.g. Prometheus Grafana ELK Datadog)
- Demonstrated incident management and root cause analysis skills
- Experience designing and implementing CI/CD pipelines
- Strong understanding of containerization and orchestration (Docker Kubernetes)
- Ability to define and enforce reliability scalability and security best practices
- Excellent communication stakeholder management and collaboration skills
- Experience mentoring coaching and developing SRE or engineering teams
- Strong hands-on knowledge to define business process dashboards in APM tools like dynatrace with SLA ALO and SLI definition design and implementation as part of observability.
- Experience with devices like Scanner POS Devices Peripheral devices (includes On device memory based devices)
- Experience with Hardcoded protocols and software for devices and should be able to decode and run them and help integrate with other modules.
- Experience in Edge computing Google Distributed Cloud and Hybrid cloud environments.
- Experience leading SRE teams in high-growth or regulated environments
- Advanced database administration and optimization skills(both SQL e.g. MYSQL and No SQL e.g. Mongo DB databases)
Key Responsibilities:
- Team Leadership & Development:
- Technical expertise hands on experience with ability to lead the development team.
- Should be able to mentor team members and guide on the right approach for SRE related work.
- Foster a culture of operational excellence automation and continuous learning
- Conduct regular team meetings 1:1s and performance reviews
- Reliability Strategy & Architecture:
- Define and implement reliability scalability and performance strategies for critical systems
- Set standards for monitoring alerting and incident response
- Guide architectural decisions to ensure robust resilient infrastructure
- Incident & Problem Management:
- Oversee incident response root cause analysis and post-mortem processes
- Coordinate with cross-functional teams to resolve complex issues and prevent recurrence
- Drive improvements based on incident learnings
- Process Improvement & Automation:
- Identify and eliminate manual operational tasks through automation
- Optimize CI/CD pipelines and deployment processes
- Continuously enhance system reliability and efficiency
- Stakeholder Collaboration:
- Partner with engineering operations and product teams to align reliability goals with business objectives
- Communicate reliability metrics risks and progress to leadership and stakeholders
- Ensure infrastructure and processes adhere to security best practices and compliance requirements
- Experience in handling chaos and resilience
Role Name: Site Reliability Engineer - Lead Cincinnati OH - Hybrid only on w2 Role Description: As a Site Reliability Engineer - Lead you will drive the reliability scalability and performance of mission-critical systems and services while leading a team of SREs. This role combines deep technical ex...
Role Name: Site Reliability Engineer - Lead
Cincinnati OH - Hybrid only on w2
Role Description:
As a Site Reliability Engineer - Lead you will drive the reliability scalability and performance of mission-critical systems and services while leading a team of SREs. This role combines deep technical expertise with leadership mentoring and strategic planning. You will set standards for operational excellence guide incident response and foster a culture of automation and continuous improvement. Collaboration with engineering operations and product teams is essential to align reliability initiatives with business objectives and ensure seamless service delivery.
REQUIRED SKILL:
- Proven experience in site reliability DevOps or systems engineering with prior leadership or team lead responsibilities
- Strong programming/scripting skills (e.g. Python Go Bash or similar)
- Deep expertise in Linux/Unix system administration and networking
- Experience architecting and operating cloud platforms (AWS Azure GCP)
- Proficiency with infrastructure-as-code and automation tools (e.g. Terraform Ansible CloudFormation)
- Advanced knowledge of monitoring logging and alerting solutions (e.g. Prometheus Grafana ELK Datadog)
- Demonstrated incident management and root cause analysis skills
- Experience designing and implementing CI/CD pipelines
- Strong understanding of containerization and orchestration (Docker Kubernetes)
- Ability to define and enforce reliability scalability and security best practices
- Excellent communication stakeholder management and collaboration skills
- Experience mentoring coaching and developing SRE or engineering teams
- Strong hands-on knowledge to define business process dashboards in APM tools like dynatrace with SLA ALO and SLI definition design and implementation as part of observability.
- Experience with devices like Scanner POS Devices Peripheral devices (includes On device memory based devices)
- Experience with Hardcoded protocols and software for devices and should be able to decode and run them and help integrate with other modules.
- Experience in Edge computing Google Distributed Cloud and Hybrid cloud environments.
- Experience leading SRE teams in high-growth or regulated environments
- Advanced database administration and optimization skills(both SQL e.g. MYSQL and No SQL e.g. Mongo DB databases)
Key Responsibilities:
- Team Leadership & Development:
- Technical expertise hands on experience with ability to lead the development team.
- Should be able to mentor team members and guide on the right approach for SRE related work.
- Foster a culture of operational excellence automation and continuous learning
- Conduct regular team meetings 1:1s and performance reviews
- Reliability Strategy & Architecture:
- Define and implement reliability scalability and performance strategies for critical systems
- Set standards for monitoring alerting and incident response
- Guide architectural decisions to ensure robust resilient infrastructure
- Incident & Problem Management:
- Oversee incident response root cause analysis and post-mortem processes
- Coordinate with cross-functional teams to resolve complex issues and prevent recurrence
- Drive improvements based on incident learnings
- Process Improvement & Automation:
- Identify and eliminate manual operational tasks through automation
- Optimize CI/CD pipelines and deployment processes
- Continuously enhance system reliability and efficiency
- Stakeholder Collaboration:
- Partner with engineering operations and product teams to align reliability goals with business objectives
- Communicate reliability metrics risks and progress to leadership and stakeholders
- Ensure infrastructure and processes adhere to security best practices and compliance requirements
- Experience in handling chaos and resilience
View more
View less