Lead Site Reliability Engineer

Phoenix, NM - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

As a Lead Site Reliability Engineer here at Honeywell Aerospace Technologies you will play a crucial role as a subject matter expert in ensuring the reliability availability and performance of our systems and services. You will work closely with development and operations teams to implement best practices in reliability engineering automation and monitoring driving improvements across our infrastructure.

You will report directly to our SRE Engineering Manager and youll work out of our Phoenix AZlocation on a Hybrid work schedule.

In this role you will impact the efficiency and effectiveness of our operations by enhancing system reliability and performance ultimately contributing to customer satisfaction and business success.

At Honeywell Aerospace Technologies our people leaders play a critical role in developing and supporting our employees to help them perform at their best and drive change across the company. Help build a strong diverse team by recruiting talent identifying and developing successors driving retention and engagement and fostering an inclusive culture.

We are seeking a Site Reliability Engineer (SRE) with strong Database Administration (DBA) skills to ensure the reliability performance and scalability of our infrastructure and data platforms. You will work across engineering operations and data teams to build resilient systems automate operations and maintain missioncritical databases. Youll create standardized CI/CD frameworks that empower development teams while providing hands-on support to troubleshoot and resolve their build and deployment issues.

This role is ideal for someone who enjoys solving distributedsystems challenges while also diving deep into database internals performance tuning and data reliability

Responsibilities

Key Job Responsibilities

Reliability Engineering
- Define and manage service SLOs/SLIs track error budgets and drive reliability roadmaps.
- Proactively identify reliability bottlenecks lead remediation and preventative actions.
- Establish CI/CD best practices and standards across the organization
Observability & Telemetry
- Implement and scale metrics logs and tracesacross services (e.g. Prometheus/Grafana OpenTelemetry Dynatrace/Azure Monitor ELK).
- Build actionable dashboards and alertswith noise reduction and runbooks for on-call.
Incident Management
- Own on-call rotations triage and coordination; drive post-incident reviewsand blameless RCA with clear corrective actions.
- Automate rollback/roll-forward health checks and verification steps.
Performance & Capacity
- Conduct load and resilience testing; manage capacity planningand cost optimization(autoscaling right-sizing caching).
- Tune databases queues and network settings for throughput and latency.
Automation & Tooling
- Reduce toil with automation and self-service tooling; standardize deployment and recovery procedures.
- Build reliability guardrails (chaos experiments circuit breakers rate limiting backoff).
Platform & Infrastructure
- Operate and harden Kubernetesclusters container runtimes and service meshes.
- Manage infrastructure using Infrastructure as Code (IaC) -(Terraform/CloudFormation/Bicep) secrets management and policy-as-code.
Security & Compliance
- Implement DevSecOpspractices: vulnerability management dependency scanning Identity and Access Management (IAM) hardening.
Collaboration
- Partner with developers QA and product on design reviews release strategies and production readiness.
- Document standards and provide enablement sessions to elevate reliability practices.
- Create comprehensive documentation and self-service guides.
DB Activities
- Administer maintain and optimize relational and/or NoSQL databases
- Collaborate with security teams to enforce database access controls
- Create comprehensive documentation and self-service guides.

Tools & Technologies

Cloud:Azure
Containers & K8s:Docker AKS/EKS Helm Istio
Observability:OpenTelemetry Prometheus/Grafana Azure Monitor/Log Analytics Dynatrace Elastic
CI/CD:GitHub Actions or Azure DevOps Pipelines; canary/blue-green deployments
IaC & Config:Terraform/Terragrunt Bicep Vault/Azure Key Vault SSM
Security:Dependabot; Cosign; OPA

Qualifications

Required Qualifications

Education: Bachelors degree from an accredited institution in a technical discipline such as science technology engineering mathematics.
Experience:48 years in SRE/Platform/DevOps/Operations roles with ownership of production systems at scale.
Cloud:Hands-on with AWS/Azure/GCP(preferably two); strong grasp of managed services trade-offs.
Containers & Orchestration:Dockerand Kubernetes(AKS/EKS/GKE); Helm/Kustomize; service mesh familiarity (Istio).
Observability:OpenTelemetry; metrics/logs/traces design; alerting strategies; RCA & postmortems.
Infrastructure as Code:Terraform(preferred) or Cloud-native equivalents; modules remote state and CI integration.
Programming & Scripting:Proficiency in Python/Goand Bashfor automation tooling and APIs.
Reliability Practices:SLO/error budgets capacity planning chaos/resilience testing progressive delivery.
Soft Skills:Calm under pressure strong communication pragmatic decision-making and a continuous improvement mindset
Strong SQL skills and experience with at least one major database (PostgreSQL MySQL SQL Server Oracle MongoDB etc.)
Deep knowledge of database internals replication and indexing
Understanding of networking fundamentals (DNS load balancing TCP/IP)
Expertise in designing and developing reusable CI/CD pipeline templates
Proficiency with at least two CI/CD platforms (Atlassian Azure DevOps Jenkins GitHub Actions GitLab CI)
Strong experience with Docker and Kubernetes
Infrastructure as Code skills (Terraform ARM templates or CloudFormation)
Cloud platform expertise (Azure AWS or GCP)
Experience troubleshooting build and deployment issues across multiple technology stacks
Strong Git and version control workflow knowledge
Experience with automated testing frameworks (.NET: xUnit/NUnit Python: pytest)
Artifact and package management (NuGet PyPI Azure Artifacts Artifactory)
Scripting skills (PowerShell Bash Python)

We Also Value:

Advanced degree in Computer Science Engineering or related field.
Experience with additional programming languages (Java Go)
Knowledge of frontend frameworks (React Angular )
GitOps implementation experience (ArgoCD Flux)
Service mesh technologies (Istio Linkerd)
Advanced deployment strategies (blue-green canary feature flags)
Database CI/CD and migration automation (Entity Framework Flyway Liquibase)
Security scanning tools integration (SonarQube OWASP Snyk)
Monitoring and observability tools (Prometheus Grafana ELK Application Insights)
Configuration management (Ansible Chef Puppet)
Multi-cloud or hybrid cloud deployment experience
Experience building internal developer platforms
Creating CLI tools or IDE extensions for developer productivity
Policy-as-code implementation (OPA Sentinel)
Cloud certifications (Azure AWS or GCP)
Kubernetes certifications (CKA CKAD)
Experience with monorepo tools (Nx Turborepo Bazel)
API gateway and microservices architecture experience

Due to compliance with U.S. export control laws and regulations candidate must be a U.S. Person which is defined as a U.S. citizen a U.S. permanent resident or have protected status in the U.S. under asylum or refugee status or have the ability to obtain an export authorization.

Benefits:

In addition to a competitive salary leading-edge work and developing solutions side-by-side with dedicated experts in their fields Honeywell employees are eligible for a comprehensive benefits package. This package includes employer subsidized Medical Dental Vision and Life Insurance; Short-Term and Long-Term Disability; 401(k) match Flexible Spending Accounts Health Savings Accounts EAP and Educational Assistance; Parental Leave Paid Time Off (for vacation personal business sick time and parental leave) and 12 Paid Holidays. For more information visit: Benefits at Honeywell

Posting Timeline:

The application period for the job is estimated to be 40 days from the job posting date; however this may be shortened or extended depending on business needs and the availability of qualified candidates.

#AERO26

Required Experience:

DescriptionAs a Lead Site Reliability Engineer here at Honeywell Aerospace Technologies you will play a crucial role as a subject matter expert in ensuring the reliability availability and performance of our systems and services. You will work closely with development and operations teams to impleme...

Description

You will report directly to our SRE Engineering Manager and youll work out of our Phoenix AZlocation on a Hybrid work schedule.

In this role you will impact the efficiency and effectiveness of our operations by enhancing system reliability and performance ultimately contributing to customer satisfaction and business success.

This role is ideal for someone who enjoys solving distributedsystems challenges while also diving deep into database internals performance tuning and data reliability

Responsibilities

Key Job Responsibilities

Reliability Engineering
- Define and manage service SLOs/SLIs track error budgets and drive reliability roadmaps.
- Proactively identify reliability bottlenecks lead remediation and preventative actions.
- Establish CI/CD best practices and standards across the organization
Observability & Telemetry
- Implement and scale metrics logs and tracesacross services (e.g. Prometheus/Grafana OpenTelemetry Dynatrace/Azure Monitor ELK).
- Build actionable dashboards and alertswith noise reduction and runbooks for on-call.
Incident Management
- Own on-call rotations triage and coordination; drive post-incident reviewsand blameless RCA with clear corrective actions.
- Automate rollback/roll-forward health checks and verification steps.
Performance & Capacity
- Conduct load and resilience testing; manage capacity planningand cost optimization(autoscaling right-sizing caching).
- Tune databases queues and network settings for throughput and latency.
Automation & Tooling
- Reduce toil with automation and self-service tooling; standardize deployment and recovery procedures.
- Build reliability guardrails (chaos experiments circuit breakers rate limiting backoff).
Platform & Infrastructure
- Operate and harden Kubernetesclusters container runtimes and service meshes.
- Manage infrastructure using Infrastructure as Code (IaC) -(Terraform/CloudFormation/Bicep) secrets management and policy-as-code.
Security & Compliance
- Implement DevSecOpspractices: vulnerability management dependency scanning Identity and Access Management (IAM) hardening.
Collaboration
- Partner with developers QA and product on design reviews release strategies and production readiness.
- Document standards and provide enablement sessions to elevate reliability practices.
- Create comprehensive documentation and self-service guides.
DB Activities
- Administer maintain and optimize relational and/or NoSQL databases
- Collaborate with security teams to enforce database access controls
- Create comprehensive documentation and self-service guides.

Tools & Technologies

Cloud:Azure
Containers & K8s:Docker AKS/EKS Helm Istio
Observability:OpenTelemetry Prometheus/Grafana Azure Monitor/Log Analytics Dynatrace Elastic
CI/CD:GitHub Actions or Azure DevOps Pipelines; canary/blue-green deployments
IaC & Config:Terraform/Terragrunt Bicep Vault/Azure Key Vault SSM
Security:Dependabot; Cosign; OPA

Qualifications

Required Qualifications

Education: Bachelors degree from an accredited institution in a technical discipline such as science technology engineering mathematics.
Experience:48 years in SRE/Platform/DevOps/Operations roles with ownership of production systems at scale.
Cloud:Hands-on with AWS/Azure/GCP(preferably two); strong grasp of managed services trade-offs.
Containers & Orchestration:Dockerand Kubernetes(AKS/EKS/GKE); Helm/Kustomize; service mesh familiarity (Istio).
Observability:OpenTelemetry; metrics/logs/traces design; alerting strategies; RCA & postmortems.
Infrastructure as Code:Terraform(preferred) or Cloud-native equivalents; modules remote state and CI integration.
Programming & Scripting:Proficiency in Python/Goand Bashfor automation tooling and APIs.
Reliability Practices:SLO/error budgets capacity planning chaos/resilience testing progressive delivery.
Soft Skills:Calm under pressure strong communication pragmatic decision-making and a continuous improvement mindset
Strong SQL skills and experience with at least one major database (PostgreSQL MySQL SQL Server Oracle MongoDB etc.)
Deep knowledge of database internals replication and indexing
Understanding of networking fundamentals (DNS load balancing TCP/IP)
Expertise in designing and developing reusable CI/CD pipeline templates
Proficiency with at least two CI/CD platforms (Atlassian Azure DevOps Jenkins GitHub Actions GitLab CI)
Strong experience with Docker and Kubernetes
Infrastructure as Code skills (Terraform ARM templates or CloudFormation)
Cloud platform expertise (Azure AWS or GCP)
Experience troubleshooting build and deployment issues across multiple technology stacks
Strong Git and version control workflow knowledge
Experience with automated testing frameworks (.NET: xUnit/NUnit Python: pytest)
Artifact and package management (NuGet PyPI Azure Artifacts Artifactory)
Scripting skills (PowerShell Bash Python)

We Also Value:

Advanced degree in Computer Science Engineering or related field.
Experience with additional programming languages (Java Go)
Knowledge of frontend frameworks (React Angular )
GitOps implementation experience (ArgoCD Flux)
Service mesh technologies (Istio Linkerd)
Advanced deployment strategies (blue-green canary feature flags)
Database CI/CD and migration automation (Entity Framework Flyway Liquibase)
Security scanning tools integration (SonarQube OWASP Snyk)
Monitoring and observability tools (Prometheus Grafana ELK Application Insights)
Configuration management (Ansible Chef Puppet)
Multi-cloud or hybrid cloud deployment experience
Experience building internal developer platforms
Creating CLI tools or IDE extensions for developer productivity
Policy-as-code implementation (OPA Sentinel)
Cloud certifications (Azure AWS or GCP)
Kubernetes certifications (CKA CKAD)
Experience with monorepo tools (Nx Turborepo Bazel)
API gateway and microservices architecture experience

Benefits:

Posting Timeline:

#AERO26

Required Experience:

Key Skills

Apply Now

About Company

Honeywell

Honeywell helps organizations solve the world's most complex challenges in automation, the future of aviation and energy transition. As a trusted partner, we provide actionable solutions and innovation through our Aerospace Technologies, Building Automation, Energy and Sustainability ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click