Senior or Lead SRE withJava and SQL Programming Changes and Automation Mindset
Irving, TX - USA
Job Summary
The Sr./ Lead Site Reliability Engineer designs enhances and operates highly reliable scalable and observable production systems in an Azure-based environment. This role blends software engineering with systems administration to build resilient infrastructure automate operations and improve system performance. The engineer applies strong engineering principles to operational challenges with a focus on reliability automation observability and continuous improvement.
Core responsibilities include engineering led incident response implementing permanent corrective actions reducing operational toil and proactively preventing failures. The role contributes to code fixes owns Dynatrace based observability and delivers custom reliability and operational reporting to improve system health and availability. Participation in a scheduled-on call rotation is required.
Minimum Requirement
4-year Computer Science Information Systems Engineering degree or relevant experience. (Degree university and year must be on the resume)
8 Years of Site reliability experience.
4-year Computer Science Information Systems Engineering degree or relevant experience. (Degree university and year must be on the resume)
8 Years of Site reliability experience.
Advanced SRE Leadership Responsibilities:
Provide technical leadership for SRE practices across multiple services or platforms.
Define and evolve reliability standards operational best practices and incident response frameworks.
Influence system architecture and design decisions to ensure scalability resilience and operability.
Serve as a subject matter expert for reliability availability and production risk management.
Act as the lead escalation point for complex and business critical production incidents.
Lead high severity incident response coordinating across engineering platform and security teams.
Drive blameless post incident reviews and ensure corrective actions are prioritized and completed.
Improve call processes escalation models and incident response effectiveness.
Own the strategy and implementation of Dynatrace based observability including dashboards and alerting standards.
Establish and monitor reliability signals (availability latency error rates) across critical systems.
Identify reliability risks and lead mitigation initiatives before customer impact occurs.
Define and maintain leadership level reliability and operational reporting.
Use production data to drive prioritization of reliability investments and operational improvements.
Communicate reliability posture risks and recommendations to senior engineering leadership.
Mentor and guide senior and mid level SREs and production support engineers.
Support hiring onboarding and technical evaluation of SRE talent.
Collaborate with squad members to define iteration plans and commitments.
Ensure compliance with HIPAA and other security regulations.
Provide technical leadership for SRE practices across multiple services or platforms.
Define and evolve reliability standards operational best practices and incident response frameworks.
Influence system architecture and design decisions to ensure scalability resilience and operability.
Serve as a subject matter expert for reliability availability and production risk management.
Act as the lead escalation point for complex and business critical production incidents.
Lead high severity incident response coordinating across engineering platform and security teams.
Drive blameless post incident reviews and ensure corrective actions are prioritized and completed.
Improve call processes escalation models and incident response effectiveness.
Own the strategy and implementation of Dynatrace based observability including dashboards and alerting standards.
Establish and monitor reliability signals (availability latency error rates) across critical systems.
Identify reliability risks and lead mitigation initiatives before customer impact occurs.
Define and maintain leadership level reliability and operational reporting.
Use production data to drive prioritization of reliability investments and operational improvements.
Communicate reliability posture risks and recommendations to senior engineering leadership.
Mentor and guide senior and mid level SREs and production support engineers.
Support hiring onboarding and technical evaluation of SRE talent.
Collaborate with squad members to define iteration plans and commitments.
Ensure compliance with HIPAA and other security regulations.
Critical Skills:
Strong experience with monitoring and observability tools (Dynatrace experience is a plus).
Hands-on experience with GitHub Actions for CI/CD automation.
Proficiency in Kubernetes and Docker for container orchestration.
Familiarity with Azure cloud services.
Experience with Ansible.
Demonstrated experience in automation of infrastructure and operational processes using scripting or configuration management tools.
Java application changes (Fixing production bugs/ Adding resiliency error handling or safeguards)
SQL / database changes (Schema updates or migrations/Indexing or query optimization/ Rolling changes out safely in production)
Knowledge of SRE principles (SLIs SLOs error budgets).
Automate repetitive operational work using Ansible Python Bash or similar tools
Strong experience with monitoring and observability tools (Dynatrace experience is a plus).
Hands-on experience with GitHub Actions for CI/CD automation.
Proficiency in Kubernetes and Docker for container orchestration.
Familiarity with Azure cloud services.
Experience with Ansible.
Demonstrated experience in automation of infrastructure and operational processes using scripting or configuration management tools.
Java application changes (Fixing production bugs/ Adding resiliency error handling or safeguards)
SQL / database changes (Schema updates or migrations/Indexing or query optimization/ Rolling changes out safely in production)
Knowledge of SRE principles (SLIs SLOs error budgets).
Automate repetitive operational work using Ansible Python Bash or similar tools