Site Reliability Engineer (SRE)
Location: Monterrey Nuevo León Mexico
Language requirement: Fluent English (spoken and written)
About the Role
Were looking for a Site Reliability Engineer (SRE) with a passion for Application Performance Monitoring (APM) and system optimization.
In this role youll be at the heart of ensuring the reliability scalability and performance mission-critical applications. Youll work closely with software engineering and operations teams to design monitoring strategies analyze performance and proactively prevent issues before they affect users.
If you thrive in fast-paced environments love solving complex technical challenges and enjoy turning data into insight this is the role for you.
What Youll Do
Design and manage APM strategies using tools like Elastic APM Datadog Dynatrace or similar platforms.
Perform deep performance analysis tracing distributed requests and identifying bottlenecks in both code and infrastructure.
Build real-time dashboards and alerting systems using Grafana Kibana or equivalent tools to visualize system health.
Proactively monitor systems to detect performance degradations security threats and system failures before users are impacted.
Define and track Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to continuously improve reliability.
Lead Root Cause Analysis (RCA) sessions after incidents and implement corrective actions to prevent recurrence.
Automate repetitive tasks and monitoring setups using Python Bash or PowerShell.
Collaborate with cross-functional teams to embed reliability performance and observability best practices into every stage of development.
Continuously refine tools processes and APM strategies to enhance efficiency reliability and visibility across platforms.
Engage with stakeholders to understand performance challenges and shape the platform roadmap.
What You Bring
Bachelors or Masters degree in Computer Science Engineering or related field.
5 years of experience in Site Reliability DevOps or Performance Engineering roles.
Proven hands-on experience with APM tools such as Elastic APM Datadog Dynatrace New Relic or AppDynamics.
Expertise in the Elastic Stack (Elasticsearch Logstash Kibana Beats) for logging monitoring and APM.
Deep understanding of SRE principles DevOps methodologies and Production Support operations.
Strong scripting ability in Python Bash or PowerShell for automation and analysis.
Solid grasp of Linux/Unix systems networking fundamentals and distributed system architecture.
Experience with containerization (Docker) and orchestration (Kubernetes).
Excellent analytical problem-solving and collaboration skills with the ability to communicate effectively in a global team.
Preferred Skills
Fluent English (Mandatory)
Experience with Infrastructure as Code (IaC) tools such as Terraform Ansible or Chef.
Familiarity with cloud-native services (AWS Azure or GCP) and serverless architectures (AWS Lambda Azure Functions).
Knowledge of CI/CD tools like GitHub Actions Azure DevOps or Jenkins.
Understanding of other observability pillars including metrics (Prometheus) and logging.
Experience working in agile environments.
HQ Platinum nace como una división de Human Quality enfocada en el servicio de headhunting creada para conectar personal de altos niveles en posiciones específicas solicitadas por nuestros clientes.