Site Reliability Engineer


Job Location:

Alpharetta, GA - USA

Monthly Salary: Not Disclosed
Posted on: 10 days ago
Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer (Must come from Banking or Financial Clients)

Hybrid 3X a week in Alpharetta GA

Must be local to Georgia or surrounding area as final round is in person interview.

Role Overview
The Site Reliability Engineer will support Cyber Data Risk & Resilience by ensuring the reliability availability performance and operational visibility of critical cybersecurity platforms and services. This role is responsible for keeping production systems running instrumenting infrastructure and application layers building meaningful monitoring and actionable alerting supporting incident response and continuously improving dashboards used by engineering operations risk and executive stakeholders.


Responsibilities
Maintain and improve the reliability availability scalability and performance of cybersecurity platforms services and supporting infrastructure
Support day-to-day operational stability by monitoring system health identifying risks responding to incidents and driving timely resolution of service-impacting issues
Instrument infrastructure applications services APIs data pipelines and cloud components to provide end-to-end visibility into system behavior and service health
Design build and continuously refine monitoring alerting logging tracing and observability capabilities across distributed systems and cloud environments
Develop meaningful and actionable alerts that reduce noise improve signal quality and enable teams to respond quickly to emerging issues
Define and track key reliability metrics including availability latency throughput error rates saturation service-level indicators service-level objectives and operational risk indicators
Build maintain and enhance dashboards for engineering operations product risk and executive stakeholders ensuring information is accurate timely and decision-ready
Continuously modify and improve executive dashboards to support regular leadership reviews of service health reliability trends incidents risks and operational performance
Partner with engineering cybersecurity infrastructure cloud and application teams to identify reliability gaps and implement long-term improvements
Participate in incident response root-cause analysis problem management and post-incident reviews to prevent recurrence and improve operational maturity
Automate operational tasks health checks reporting deployment validation and recovery procedures to improve efficiency and reduce manual effort
Collaborate with application and platform teams to embed reliability monitoring and supportability requirements into the software development lifecycle
Support CI/CD DevOps and release management practices by validating operational readiness monitoring coverage rollback plans and production support requirements
Contribute to resiliency engineering efforts including capacity planning performance tuning failover validation disaster recovery readiness and chaos/resilience testing where applicable
Ensure monitoring alerting dashboards and operational processes align with enterprise security risk compliance and governance standards

Required Qualifications
7 to 10 years of experience in site reliability engineering systems engineering software engineering DevOps infrastructure engineering or production operations
Strong experience supporting highly available distributed cloud-based or mission-critical technology platforms
Hands-on experience with observability practices including monitoring alerting logging metrics tracing dashboards and service health reporting
Experience instrumenting applications services APIs infrastructure databases and cloud components to enable end-to-end operational visibility
Strong understanding of reliability engineering concepts including SLIs SLOs SLAs error budgets incident management capacity management and operational readiness
Experience designing actionable alerts that support rapid issue detection triage escalation and resolution
Experience building and maintaining operational dashboards for technical teams support teams and senior/executive stakeholders
Strong scripting or programming skills using Python Java Bash PowerShell or similar languages for automation and operational tooling
Experience with cloud platforms such as AWS Azure or GCP
Experience with Infrastructure-as-Code tools such as Terraform or similar technologies
Experience working with CI/CD pipelines DevOps workflows release processes and production support models
Experience troubleshooting distributed systems REST services event-driven architectures messaging platforms and service-to-service integrations
Familiarity with relational and non-relational databases such as PostgreSQL MSSQL MongoDB or similar platforms
Strong analytical troubleshooting and problem-solving skills with the ability to diagnose complex technical issues across multiple layers of the stack
Strong written and verbal communication skills including the ability to translate technical issues into clear business and executive-level updates

Preferred Skills
Experience supporting cybersecurity risk resilience compliance or enterprise security platforms
Experience with observability and monitoring tools such as Splunk Grafana Prometheus Datadog Dynatrace New Relic Azure Monitor CloudWatch OpenTelemetry or similar platforms
Experience creating executive-level service health dashboards reliability scorecards operational risk reporting or incident trend reporting
Experience developing automated health checks synthetic monitoring service dependency maps and operational runbooks
Experience with incident response major incident management postmortems root-cause analysis and problem management practices
Experience with containerized and cloud-native environments including Kubernetes Docker serverless services or managed cloud platforms
Experience with distributed messaging or streaming platforms such as Apache Kafka
Familiarity with cloud-native security governance and policy tooling such as Azure Policy AWS SCP GCP constraints or related controls
Familiarity with Cloud Security Posture Management tools such as Wiz Prisma CloudGuard or similar platforms
Experience with cloud-based AI services such as Azure AI AWS Bedrock or Google Vertex AI particularly from an operational monitoring reliability or governance perspective
Experience supporting Linux and Windows environments through scripting automation monitoring and operational troubleshooting
Exposure to web technologies APIs front-end services or user-facing application monitoring

Additional Skills
Strong ownership mindset with a focus on operational excellence and service reliability
Ability to operate effectively in fast-paced production-focused environments with minimal supervision
Strong ability to prioritize issues based on customer impact business risk service criticality and operational urgency
Effective collaboration skills across engineering operations cybersecurity infrastructure risk and executive stakeholder groups
Ability to communicate service health operational risks incidents and reliability trends clearly to both technical and non-technical audiences
Proactive and continuous-improvement mindset with a focus on automation simplification resilience and measurable outcomes
Strong attention to detail when building dashboards defining metrics tuning alerts and preparing executive-level operational reporting

Site Reliability Engineer (Must come from Banking or Financial Clients) Hybrid 3X a week in Alpharetta GA Must be local to Georgia or surrounding area as final round is in person interview. Role Overview The Site Reliability Engineer will support Cyber Data Risk & Resilience by ensuring the rel...