Looking for a Site Reliability Engineer who can build scale maintain and monitor highly available secure and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs error budgets incident response postmortems). Own production readiness and operational excellence across infrastructure and delivery tooling. Ensure performance uptime and scalability while maintaining high standards of code quality and thoughtful design. Lead the transition and continuous improvement of applications and infrastructure toward resilient automated and observable systems.
Qualifications :
- Bachelors in Computer Science equivalent degree or equivalent professional experience.
- 3 years working on distributed systems and cloud operations.
- Strong hands-on experience with at least two major cloud providers (Azure AWS GCP) and their managed Kubernetes services.
- Deep experience architecting and/or operating large Kubernetes clusters: workload identity networking storage autoscaling upgrades security and multi-tenancy.
- Container expertise (Docker/OCI) packaging and configuration and service mesh experience is a plus.
- Advanced GitHub Actions expertise: reusable workflows/composites concurrency/queueing environments and approvals OIDC federation artifacts caching dependency review and policy/as-code.
- Strong Python skills (required) for Pulumi-based IaC tooling and automation; Golang knowledge is a plus.
- Familiarity with CI/CD change management and experience in progressive delivery.
- Observability stack experience and alerting practices tied to SLOs.
- Configuration of cloud-native networking storage Linux security controls and cost governance.
- Experience migrating and scaling infrastructure across clouds.
- Relevant certifications (e.g. CKA) are a plus.
- Advanced English (optional)
Responsibilities
- Design build and operate production-grade Kubernetes (AKS) clusters and supporting services with high availability security and cost optimization.
- Architect implement and maintain CI/CD using GitHub Actions (advanced) including reusable workflows matrices environments required approvals OIDC-based cloud auth self-hosted runners and policy controls.
- Define codify and evolve Infrastructure as Code with Pulumi (Python) as the primary stack; create reusable components enforce code reviews testing and documentation.
- Develop and maintain configuration management with Ansible (roles collections inventories playbooks) for OS middleware and app operations.
- Implement progressive delivery and deployment strategies (blue/green canary feature flags) and automate rollback/roll-forward based on health checks and SLOs.
- Establish comprehensive observability (metrics logs traces profiles) with alerting tied to SLIs; drive capacity planning performance tuning and chaos/resiliency testing.
- Lead incident management and on-call response; coordinate triage communication mitigation root-cause analysis and follow-through on corrective actions.
- Partner with product and engineering to design for reliability (readiness/liveness probes graceful shutdown backpressure retries/timeouts circuit breakers).
- Implement security best practices (least privilege secrets management) and ensure compliance with internal policies and audits.
- Continuously review existing systems eliminate toil via automation reduce technical debt and document operational runbooks and standards.
Essential technologies and/or skills:
- Exceptional problem-solving with the ability to anticipate and remediate issues before they affect business productivity.
- Proven experience handling production environments and being available for emergencies.
- Clear calm communication with technical and non-technical audiences.
- Passion for detail and a structured methodical mindset in design execution and documentation.
- Professional positive approach with strong ethics and high working morale.
- Curiosity to learn bias for automation and a true can-do attitude.
- Cloud Platforms (Azure AWS)
- Version control tools (Git/GitHub)
- Continuous Integration servers (GitHub Actions as primary)
- Configuration management (Ansible)
- Containers (Docker/OCI)
- Infrastructure Orchestration (Pulumi/Python)
- Monitoring and analytics (metrics/logs/traces APM alerting)
- Secrets management and security scanning/signing
- Incident management and on-call tooling
- Python (scripting level)
- MySQL
Additional Information :
What We Offer
- Competitive monthly net salary: $45000 $70000 MXN.
- 100% remote work from anywhere in Mexico.
- Major Medical Insurance and healthcare coverage.
- Home office and ergonomics support (internet electricity office chair).
- Professional development opportunities including English classes.
- Wellness benefits such as TotalPass gym discounts.
- Savings plan.
- Paid time off including personal days.
- A collaborative international and growth-oriented environment.
All your information will be kept confidential according to EEO guidelines.
Remote Work :
Yes
Employment Type :
Full-time
Looking for a Site Reliability Engineer who can build scale maintain and monitor highly available secure and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs error budgets incident response postmortems). Own production readin...
Looking for a Site Reliability Engineer who can build scale maintain and monitor highly available secure and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs error budgets incident response postmortems). Own production readiness and operational excellence across infrastructure and delivery tooling. Ensure performance uptime and scalability while maintaining high standards of code quality and thoughtful design. Lead the transition and continuous improvement of applications and infrastructure toward resilient automated and observable systems.
Qualifications :
- Bachelors in Computer Science equivalent degree or equivalent professional experience.
- 3 years working on distributed systems and cloud operations.
- Strong hands-on experience with at least two major cloud providers (Azure AWS GCP) and their managed Kubernetes services.
- Deep experience architecting and/or operating large Kubernetes clusters: workload identity networking storage autoscaling upgrades security and multi-tenancy.
- Container expertise (Docker/OCI) packaging and configuration and service mesh experience is a plus.
- Advanced GitHub Actions expertise: reusable workflows/composites concurrency/queueing environments and approvals OIDC federation artifacts caching dependency review and policy/as-code.
- Strong Python skills (required) for Pulumi-based IaC tooling and automation; Golang knowledge is a plus.
- Familiarity with CI/CD change management and experience in progressive delivery.
- Observability stack experience and alerting practices tied to SLOs.
- Configuration of cloud-native networking storage Linux security controls and cost governance.
- Experience migrating and scaling infrastructure across clouds.
- Relevant certifications (e.g. CKA) are a plus.
- Advanced English (optional)
Responsibilities
- Design build and operate production-grade Kubernetes (AKS) clusters and supporting services with high availability security and cost optimization.
- Architect implement and maintain CI/CD using GitHub Actions (advanced) including reusable workflows matrices environments required approvals OIDC-based cloud auth self-hosted runners and policy controls.
- Define codify and evolve Infrastructure as Code with Pulumi (Python) as the primary stack; create reusable components enforce code reviews testing and documentation.
- Develop and maintain configuration management with Ansible (roles collections inventories playbooks) for OS middleware and app operations.
- Implement progressive delivery and deployment strategies (blue/green canary feature flags) and automate rollback/roll-forward based on health checks and SLOs.
- Establish comprehensive observability (metrics logs traces profiles) with alerting tied to SLIs; drive capacity planning performance tuning and chaos/resiliency testing.
- Lead incident management and on-call response; coordinate triage communication mitigation root-cause analysis and follow-through on corrective actions.
- Partner with product and engineering to design for reliability (readiness/liveness probes graceful shutdown backpressure retries/timeouts circuit breakers).
- Implement security best practices (least privilege secrets management) and ensure compliance with internal policies and audits.
- Continuously review existing systems eliminate toil via automation reduce technical debt and document operational runbooks and standards.
Essential technologies and/or skills:
- Exceptional problem-solving with the ability to anticipate and remediate issues before they affect business productivity.
- Proven experience handling production environments and being available for emergencies.
- Clear calm communication with technical and non-technical audiences.
- Passion for detail and a structured methodical mindset in design execution and documentation.
- Professional positive approach with strong ethics and high working morale.
- Curiosity to learn bias for automation and a true can-do attitude.
- Cloud Platforms (Azure AWS)
- Version control tools (Git/GitHub)
- Continuous Integration servers (GitHub Actions as primary)
- Configuration management (Ansible)
- Containers (Docker/OCI)
- Infrastructure Orchestration (Pulumi/Python)
- Monitoring and analytics (metrics/logs/traces APM alerting)
- Secrets management and security scanning/signing
- Incident management and on-call tooling
- Python (scripting level)
- MySQL
Additional Information :
What We Offer
- Competitive monthly net salary: $45000 $70000 MXN.
- 100% remote work from anywhere in Mexico.
- Major Medical Insurance and healthcare coverage.
- Home office and ergonomics support (internet electricity office chair).
- Professional development opportunities including English classes.
- Wellness benefits such as TotalPass gym discounts.
- Savings plan.
- Paid time off including personal days.
- A collaborative international and growth-oriented environment.
All your information will be kept confidential according to EEO guidelines.
Remote Work :
Yes
Employment Type :
Full-time
View more
View less