We are seeking an experienced Site Reliability Engineer (SRE) to join a dynamic technology team supporting large-scale infrastructure and AML systems. This role combines software engineering systems engineering automation and operational excellence to ensure high availability scalability and reliability across critical platforms.
The ideal candidate is passionate about infrastructure automation system performance cloud-native technologies and operational reliability in fast-paced environments.
Key Responsibilities
Design build and maintain highly available scalable and fault-tolerant systems
Collaborate closely with software engineering teams to improve system reliability and performance
Develop and maintain automation tools and operational procedures to improve efficiency and reduce manual intervention
Monitor infrastructure and application performance to proactively identify and resolve issues
Implement and maintain monitoring alerting and observability solutions including SLIs SLOs and SLAs
Participate in 24/7 on-call rotations incident management root-cause analysis and blameless post-mortems
Ensure infrastructure security compliance and operational best practices
Support large-scale web traffic and machine learning data processing environments
Requirements
Technical Skills
Proficiency in at least one programming language such as Python Go Java or C
Strong scripting and automation skills
Good understanding of Linux operating systems and network architecture
Experience with Docker and Kubernetes
Hands-on experience with monitoring tools such as Prometheus and Grafana
Knowledge of relational databases and database modeling
Preferred Skills
Exposure to machine learning frameworks such as TensorFlow PyTorch MXNet or PaddlePaddle
Strong analytical and problem-solving abilities
Excellent communication and collaboration skills
Ability to work effectively in a fast-paced and cross-functional environment
Qualifications
Bachelors or Masters Degree in Computer Science Information Technology Computer Engineering or related field
Minimum 3 years of experience in Site Reliability Engineering Systems Engineering or Software Engineering
Why Join Us
Opportunity to work on large-scale distributed systems and modern infrastructure technologies
Exposure to cloud-native environments and advanced automation practices
Collaborative and technology-driven working environment
Career growth and continuous learning opportunities
Competitive salary and benefits package
Overview We are seeking an experienced Site Reliability Engineer (SRE) to join a dynamic technology team supporting large-scale infrastructure and AML systems. This role combines software engineering systems engineering automation and operational excellence to ensure high availability scalability an...
Overview
We are seeking an experienced Site Reliability Engineer (SRE) to join a dynamic technology team supporting large-scale infrastructure and AML systems. This role combines software engineering systems engineering automation and operational excellence to ensure high availability scalability and reliability across critical platforms.
The ideal candidate is passionate about infrastructure automation system performance cloud-native technologies and operational reliability in fast-paced environments.
Key Responsibilities
Design build and maintain highly available scalable and fault-tolerant systems
Collaborate closely with software engineering teams to improve system reliability and performance
Develop and maintain automation tools and operational procedures to improve efficiency and reduce manual intervention
Monitor infrastructure and application performance to proactively identify and resolve issues
Implement and maintain monitoring alerting and observability solutions including SLIs SLOs and SLAs
Participate in 24/7 on-call rotations incident management root-cause analysis and blameless post-mortems
Ensure infrastructure security compliance and operational best practices
Support large-scale web traffic and machine learning data processing environments
Requirements
Technical Skills
Proficiency in at least one programming language such as Python Go Java or C
Strong scripting and automation skills
Good understanding of Linux operating systems and network architecture
Experience with Docker and Kubernetes
Hands-on experience with monitoring tools such as Prometheus and Grafana
Knowledge of relational databases and database modeling
Preferred Skills
Exposure to machine learning frameworks such as TensorFlow PyTorch MXNet or PaddlePaddle
Strong analytical and problem-solving abilities
Excellent communication and collaboration skills
Ability to work effectively in a fast-paced and cross-functional environment
Qualifications
Bachelors or Masters Degree in Computer Science Information Technology Computer Engineering or related field
Minimum 3 years of experience in Site Reliability Engineering Systems Engineering or Software Engineering
Why Join Us
Opportunity to work on large-scale distributed systems and modern infrastructure technologies
Exposure to cloud-native environments and advanced automation practices
Collaborative and technology-driven working environment
Career growth and continuous learning opportunities