Site Reliability Engineer

QAD, Inc.

Posted on : 13-06-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Mexico City - Mexico

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 13-06-2025

Job Description

We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function you will play a critical role in ensuring the reliability scalability and performance of our mission-critical services that power our customer experience. This is an exciting opportunity to shape our SRE practices drive automation and significantly impact our products operational excellence.

What Youll Do:

Drive Operational Excellence: Design implement and maintain highly available scalable and resilient systems that deliver exceptional customer experience.
Datadog Expert: Be one of the go-to experts for Datadog. You will be responsible for defining implementing and enforcing best practices for monitoring alerting logging tracing and synthetic testing across our entire AWS environment. This includes deep hands-on configuration dashboarding troubleshooting and optimization within Datadog.
Software Development for Reliability: Develop robust well-tested and maintainable software and tooling to automate operational tasks create self-service capabilities for engineering teams and enhance system reliability. This will involve building applications not just scripts.
Toil Reduction Champion: Identify and eliminate toil through automation process improvements and systematic problem-solving. Work proactively to shift our operational focus from reactive firefighting to proactive engineering.
Incident Management & Post-Mortems: Contribute to and evolve our incident response framework participating in on-call rotations (using OpsGenie). Lead blameless post-mortems extracting actionable insights and driving systemic improvements to prevent recurrence.
Reliability Metrics & Goals: Collaborate with engineering teams to define implement and track Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets. Use these metrics to drive continuous improvement and make data-driven decisions about reliability investments.
Infrastructure as Code: Leverage and contribute to our infrastructure as code (IaC) efforts moving towards a fully automated environment using Terraform and GitHub Actions.
System Design & Architecture: Provide SRE expertise in system design reviews influencing architectural decisions to build reliability observability and scalability into our services from the ground up.
Knowledge Sharing & Mentorship: Document processes build runbooks and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.

Qualifications :

What Youll Bring:

5 years of direct Site Reliability Engineering (SRE) experience or equivalent experience in a production engineering role focused on system reliability.
Deep expertise and hands-on experience with Datadog. Proven ability to implement manage and optimize Datadog for comprehensive monitoring (APM infrastructure logs synthetics RUM) alerting and troubleshooting in complex cloud environments.
Strong software development proficiency in Python (required). Demonstrated ability to build applications tools and automation frameworks beyond simple scripting.
Experience with Golang (desired).
Solid understanding of cloud-native architectures and best practices specifically within AWS (EKS Load Balancers Aurora RDS Serverless Postgres S3 Secrets Manager MSK Bedrock SageMaker Route53).
Experience with containerization and orchestration technologies particularly Kubernetes (EKS).
Familiarity with CI/CD pipelines and tools (Jenkins GitHub Actions).
A strong understanding of distributed systems concepts networking and security principles.
Experience with incident management processes and tools.
Excellent problem-solving skills with a methodical and data-driven approach to troubleshooting complex systems.
Strong communication and collaboration skills with the ability to work effectively with diverse engineering teams.
A proactive mindset with a passion for automation continuous improvement and blameless culture.

Bonus Points (Nice to Have):

Experience defining and working with SLOs SLIs and Error Budgets.
Familiarity with other observability tools or concepts beyond Datadog.
Experience with feature flagging platforms like LaunchDarkly.

Additional Information :

Why Join Us

Be a key member of a growing SRE team and help shape our operational future.
Work on challenging problems at the intersection of software engineering operations and customer experience.
Opportunity to significantly reduce toil and drive impactful automation.
Collaborate with talented engineers in a supportive and learning-oriented environment.
Your health and well being are important to us. We provide programs that help you strike a healthy work-life balance.
Opportunity to join a growing business launching into its next phase of expansion and transformation.
Collaborative culture of smart and hard-working people who support one another to get the job done.
An atmosphere of growth and opportunity where idea-sharing is always prioritized over level or hierarchy.
Compensation packages based on experience and desired skill set

About QAD and QAD Redzone:

QAD Inc. is a leading provider of adaptive cloud-based enterprise software and services for global manufacturing companies. Global manufacturers face ever-increasing disruption caused by technology-driven innovation and changing consumer preferences. In order to survive and thrive manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises.

QAD Redzone helps to enable QADs vision for the Adaptive Enterprise. Labor productivity improvements directly impact efficiency. Productive and empowered employees increase the effective capacity of your plant and accelerate time to productivity for new employees giving manufacturers the agility to increase production beyond what was previously possible without having to invest in production equipment or new plants and reduce the amount and impact of employee attrition. Empowered employees with a growth mindset take extreme ownership of challenges that impact their production goals creating resilience in the face of disruption.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race color sex age national origin religion sexual orientation gender identity status as a veteran and basis of disability or any other federal state or local protected class.

#LI-Remote

Remote Work :

Yes

Employment Type :

Full-time

Employment Type

Remote

Company Industry

Department / Functional Area

Engineering

Key Skills

Apply Now

About Company

QAD, Inc.

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Site Reliability Engineer

QAD, Inc.

Job Description

Employment Type

Company Industry

Department / Functional Area

Key Skills

About Company

Similar Jobs

Site Services Coordinator

Coordinator - Program Site

Process Engineer

Weld Engineer

Infrastructure Engineer

Distribution Engineer

Water Treatment Engineer

Casting Maintenance Engineer