Principal Site Reliability Engineer

Atlanta, GA - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Who We Are

QGenda is redefining healthcare workforce management everywhere care is delivered. Were on a mission to empower the healthcare industry to better onboarding deploy and manage their workforce. Over 4500 healthcare organizations have trusted us to help them make strategic workforce decisions through our unified software platform. With more than 700 employees across the US we are united in our vision and culture to make a difference for our customers while enjoying the day-to-day.

At QGenda we value our employees and their contributions toward the success of the business. We strive to create a dynamic work environment that fosters growth innovation and collaboration where employees can be proud of the work they do and the impact it has on the healthcare industry.

QGenda is headquartered in Atlanta.

To learn more about QGenda visit us at or follow us on Instagramor LinkedIn.

About Your Role

As a Principal Site Reliability Engineer you will work with our Infrastructure and Product Development Teams to design operate and scale highly available services on lead automation and infrastructure-as-code efforts to eliminate toil standardize configuration and expand observability across metrics logs and traces. You will evaluate and introduce AWS services and tooling that improve reliability performance and developer velocity. This role offers the opportunity to shape our reliability roadmap and make a measurable impact on the resilience and evolution of our technology stack.

How Youll Make an Impact

System Reliability and Performance:

Design implement and manage scalable systems that ensure high availability fault tolerance and optimal performance.
Continuously monitor and enhance system health and performance through data analysis and metrics.
Embed observability (metrics logs traces alerts) with actionable thresholds and up-to-date runbooks.

Automation and Tooling:

Eliminate toil by building automation and self-service tools for common operational workflows.
Own CI/CD pipelines (build test security scans) and enable progressive delivery (blue/green canary).
Manage infrastructure as code via Terraform and configuration management with Git-backed workflows.

Incident Management and Troubleshooting:

Participate in on-call; triage mitigate and resolve incidents within defined SLAs.
Lead incident response and blameless post-incident reviews; document RCAs and drive corrective actions to closure.
Maintain runbooks/playbooks and regularly perform disaster recovery scenarios.

Infrastructure Management:

Operate and secure AWS environments (IAM VPC EC2/ECS RDS S3 Lambda etc.) with a focus on resilience and compliance.
Optimize cost performance and reliability (rightsizing autoscaling reservations/savings plans tagging spend monitoring etc.).

Collaboration & Culture:

Serve as a technical advisor to engineering teams on infrastructure and operations best practices.
Mentor peers on SRE practices; promote observability continuous improvement and a blameless culture.
Contribute to roadmaps and capacity planning to align reliability goals with product objectives.

Who You Are

Availability for off-hours deployment and upgrades of production systems during release and maintenance windows. This is a rotational setup where you would be on two weeks at a time.
Strong problem-solving skills and ability to work effectively under pressure.
Excellent communication skills for cross-functional collaboration as well as documentation creation.

Experience You Bring

B.S. in Computer Science Computer Information Systems or Computer Engineering from a major U.S. university or equivalent industry experience
8 years of experience as a DevOps SRE or Systems Engineer
Advanced proficiency with at least one scripting or programming language
Experience with Docker and container orchestration tools such as AWS ECS
Hands-on experience building infrastructure and supporting applications in AWS using services such as Lambda EC2 ECS S3 SNS SQS RDS Redshift and Elasticache
Experience with logging creating dashboards and alerts using observability tools such as Datadog and Amazon CloudWatch
Strong understanding of networking and DNS
Familiarity with configuration management and infrastructure as code (IaC) tools such as Terraform
Firm understanding and experience with Agile and Scrum SDLC processes
Using distributed version control system experience (Git preferred) to check-in code branching merging pull request code review etc
Knowledge of CI/CD best practices and tools such as AWS CodeBuild Jenkins and/or TeamCity
Experience designing and delivering secure high performance and highly available cloud services

Not Required But Nice to Have

Experience with automation tools related to MLOps or AIOps such as AWS Bedrock and/or SageMaker.

#LI-Hybrid

Applicants for this position must be authorized to work for any employer in the United States(U.S.) including being located in the US. We are unable to sponsor take over sponsorship of or hire candidates with an employment visa at this time.

Whats In It For You

We offer a comprehensive total rewards package to support our full-time employees and their familys day-to-day needs well-being and major life events which includes:

Fully company-paid options for medical (both in-person and virtual) dental and vision insurance
Generous paid time off (PTO) policy to enjoy periods of uninterrupted rest and relaxation for a healthy work/life balance
Paid parental leave for birth adoption or permanent placement
401(k) with company match
Options to work in a hybrid-working model or remotely from home depending on the position
Annual Costco membership cell phone stipend commuter benefits in-office perks and more

QGenda delivers technology solutions to improve how healthcare is delivered and increase access - for everyone. We can only succeed by bringing together diverse minds thoughts ideas and team members to create better solutions for our customers and make us a better company as a whole. We are committed to creating a culture of embracing diversity inclusion and equity for all.

QGenda is an Equal Employment Opportunity employer and makes all employment decisions without regard to race color religion creed gender sex (including pregnancy) sexual orientation gender identity or expression natural origin ancestry age marital status disability or genetic information military status status as a disabled or protected veteran or any other protected status under applicable law.

If you require accommodations or assistance to complete the online application process please contact and identify the type of accommodation or assistance you are requesting. Do not include any medical or health information in this email. We will respond to your email promptly.

Required Experience:

Staff IC

Who We AreQGenda is redefining healthcare workforce management everywhere care is delivered. Were on a mission to empower the healthcare industry to better onboarding deploy and manage their workforce. Over 4500 healthcare organizations have trusted us to help them make strategic workforce decisions...

Who We Are

QGenda is headquartered in Atlanta.

To learn more about QGenda visit us at or follow us on Instagramor LinkedIn.

About Your Role

How Youll Make an Impact

System Reliability and Performance:

Design implement and manage scalable systems that ensure high availability fault tolerance and optimal performance.
Continuously monitor and enhance system health and performance through data analysis and metrics.
Embed observability (metrics logs traces alerts) with actionable thresholds and up-to-date runbooks.

Automation and Tooling:

Eliminate toil by building automation and self-service tools for common operational workflows.
Own CI/CD pipelines (build test security scans) and enable progressive delivery (blue/green canary).
Manage infrastructure as code via Terraform and configuration management with Git-backed workflows.

Incident Management and Troubleshooting:

Participate in on-call; triage mitigate and resolve incidents within defined SLAs.
Lead incident response and blameless post-incident reviews; document RCAs and drive corrective actions to closure.
Maintain runbooks/playbooks and regularly perform disaster recovery scenarios.

Infrastructure Management:

Operate and secure AWS environments (IAM VPC EC2/ECS RDS S3 Lambda etc.) with a focus on resilience and compliance.
Optimize cost performance and reliability (rightsizing autoscaling reservations/savings plans tagging spend monitoring etc.).

Collaboration & Culture:

Serve as a technical advisor to engineering teams on infrastructure and operations best practices.
Mentor peers on SRE practices; promote observability continuous improvement and a blameless culture.
Contribute to roadmaps and capacity planning to align reliability goals with product objectives.

Who You Are

Availability for off-hours deployment and upgrades of production systems during release and maintenance windows. This is a rotational setup where you would be on two weeks at a time.
Strong problem-solving skills and ability to work effectively under pressure.
Excellent communication skills for cross-functional collaboration as well as documentation creation.

Experience You Bring

B.S. in Computer Science Computer Information Systems or Computer Engineering from a major U.S. university or equivalent industry experience
8 years of experience as a DevOps SRE or Systems Engineer
Advanced proficiency with at least one scripting or programming language
Experience with Docker and container orchestration tools such as AWS ECS
Hands-on experience building infrastructure and supporting applications in AWS using services such as Lambda EC2 ECS S3 SNS SQS RDS Redshift and Elasticache
Experience with logging creating dashboards and alerts using observability tools such as Datadog and Amazon CloudWatch
Strong understanding of networking and DNS
Familiarity with configuration management and infrastructure as code (IaC) tools such as Terraform
Firm understanding and experience with Agile and Scrum SDLC processes
Using distributed version control system experience (Git preferred) to check-in code branching merging pull request code review etc
Knowledge of CI/CD best practices and tools such as AWS CodeBuild Jenkins and/or TeamCity
Experience designing and delivering secure high performance and highly available cloud services

Not Required But Nice to Have

Experience with automation tools related to MLOps or AIOps such as AWS Bedrock and/or SageMaker.

#LI-Hybrid

Whats In It For You

We offer a comprehensive total rewards package to support our full-time employees and their familys day-to-day needs well-being and major life events which includes:

Fully company-paid options for medical (both in-person and virtual) dental and vision insurance
Generous paid time off (PTO) policy to enjoy periods of uninterrupted rest and relaxation for a healthy work/life balance
Paid parental leave for birth adoption or permanent placement
401(k) with company match
Options to work in a hybrid-working model or remotely from home depending on the position
Annual Costco membership cell phone stipend commuter benefits in-office perks and more

Required Experience:

Staff IC

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

QGenda

Healthcare Workforce Management software to boost engagement and retention, optimize staffing, and reduce labor costs.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Principal Site Reliability Engineer

Atlanta, GA - USA

Job Summary

Who We Are

About Your Role

How Youll Make an Impact

Who You Are

Experience You Bring

Not Required But Nice to Have

Whats In It For You

Who We Are

About Your Role

How Youll Make an Impact

Who You Are

Experience You Bring

Not Required But Nice to Have

Whats In It For You

Key Skills

About Company

Related Jobs