We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function you will play a critical role in ensuring the reliability scalability and performance of our mission-critical services that power our customer experience. This is an exciting opportunity to shape our SRE practices drive automation and significantly impact our products operational excellence.
What Youll Do:
- Drive Operational Excellence: Design implement and maintain highly available scalable and resilient systems that deliver exceptional customer experience.
- Datadog Expert: Be one of the go-to experts for Datadog. You will be responsible for defining implementing and enforcing best practices for monitoring alerting logging tracing and synthetic testing across our entire AWS environment. This includes deep hands-on configuration dashboarding troubleshooting and optimization within Datadog.
- Software Development for Reliability: Develop robust well-tested and maintainable software and tooling to automate operational tasks create self-service capabilities for engineering teams and enhance system reliability. This will involve building applications not just scripts.
- Toil Reduction Champion: Identify and eliminate toil through automation process improvements and systematic problem-solving. Work proactively to shift our operational focus from reactive firefighting to proactive engineering.
- Incident Management & Post-Mortems: Contribute to and evolve our incident response framework participating in on-call rotations (using OpsGenie). Lead blameless post-mortems extracting actionable insights and driving systemic improvements to prevent recurrence.
- Reliability Metrics & Goals: Collaborate with engineering teams to define implement and track Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets. Use these metrics to drive continuous improvement and make data-driven decisions about reliability investments.
- Infrastructure as Code: Leverage and contribute to our infrastructure as code (IaC) efforts moving towards a fully automated environment using Terraform and GitHub Actions.
- System Design & Architecture: Provide SRE expertise in system design reviews influencing architectural decisions to build reliability observability and scalability into our services from the ground up.
- Knowledge Sharing & Mentorship: Document processes build runbooks and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.
Qualifications :
What Youll Bring:
- 5 years of direct Site Reliability Engineering (SRE) experience or equivalent experience in a production engineering role focused on system reliability.
- Deep expertise and hands-on experience with Datadog. Proven ability to implement manage and optimize Datadog for comprehensive monitoring (APM infrastructure logs synthetics RUM) alerting and troubleshooting in complex cloud environments.
- Strong software development proficiency in Python (required). Demonstrated ability to build applications tools and automation frameworks beyond simple scripting.
- Experience with Golang (desired).
- Solid understanding of cloud-native architectures and best practices specifically within AWS (EKS Load Balancers Aurora RDS Serverless Postgres S3 Secrets Manager MSK Bedrock SageMaker Route53).
- Experience with containerization and orchestration technologies particularly Kubernetes (EKS).
- Familiarity with CI/CD pipelines and tools (Jenkins GitHub Actions).
- A strong understanding of distributed systems concepts networking and security principles.
- Experience with incident management processes and tools.
- Excellent problem-solving skills with a methodical and data-driven approach to troubleshooting complex systems.
- Strong communication and collaboration skills with the ability to work effectively with diverse engineering teams.
- A proactive mindset with a passion for automation continuous improvement and blameless culture.
Bonus Points (Nice to Have):
- Experience defining and working with SLOs SLIs and Error Budgets.
- Familiarity with other observability tools or concepts beyond Datadog.
- Experience with feature flagging platforms like LaunchDarkly.
Additional Information :
Why Join Us
- Be a key member of a growing SRE team and help shape our operational future.
- Work on challenging problems at the intersection of software engineering operations and customer experience.
- Opportunity to significantly reduce toil and drive impactful automation.
- Collaborate with talented engineers in a supportive and learning-oriented environment.
- Your health and well being are important to us. We provide programs that help you strike a healthy work-life balance.
- Opportunity to join a growing business launching into its next phase of expansion and transformation.
- Collaborative culture of smart and hard-working people who support one another to get the job done.
- An atmosphere of growth and opportunity where idea-sharing is always prioritized over level or hierarchy.
- Compensation packages based on experience and desired skill set
About QAD and QAD Redzone:
QAD Inc. is a leading provider of adaptive cloud-based enterprise software and services for global manufacturing companies. Global manufacturers face ever-increasing disruption caused by technology-driven innovation and changing consumer preferences. In order to survive and thrive manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises.
QAD Redzone helps to enable QADs vision for the Adaptive Enterprise. Labor productivity improvements directly impact efficiency. Productive and empowered employees increase the effective capacity of your plant and accelerate time to productivity for new employees giving manufacturers the agility to increase production beyond what was previously possible without having to invest in production equipment or new plants and reduce the amount and impact of employee attrition. Empowered employees with a growth mindset take extreme ownership of challenges that impact their production goals creating resilience in the face of disruption.
We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race color sex age national origin religion sexual orientation gender identity status as a veteran and basis of disability or any other federal state or local protected class.
#LI-Remote
Remote Work :
Yes
Employment Type :
Full-time