We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our team. As an you will play a pivotal role in ensuring the reliability scalability and performance of our cloud-based infrastructure.
You will collaborate closely with development operations and other teams to implement and maintain efficient and resilient systems.
Responsibilities:
- Infrastructure Automation: Developing deploying and overseeing Infrastructure as Code (IaC) solutions using tools such as Terraform and Ansible to automate the provisioning configuration and deployment processes.
- Cloud Platform Expertise: Deep understanding of AWS cloud services including EC2 S3 VPC RDS EKS ECS CF and more. Experience with serverless architecture and AWS Lambda functions is a plus.
- Containerization and Orchestration: Proficiency in containerization technologies (Docker) and orchestration platforms (Kubernetes) with deploying applications using tools like K8s and Helm.
- CI/CD Pipelines: Build and maintain robust CI/CD pipelines using tools like Jenkins.
- Monitoring and Alerting: Implement comprehensive monitoring and alerting solutions using tools like ELK Datadog CloudWatch Grafana to proactively identify and resolve issues.
- Incident Management: Drive incident response processes troubleshoot complex issues and perform Root Cause analysis (RCA) to prevent future occurrences (CAPA).
- Performance Tuning: Continuously optimize system performance identify bottlenecks and implement strategies to improve scalability and efficiency.
- Cost Optimization: Identify and implement strategies to reduce cloud costs while maintaining performance and reliability.
- Security Best Practices: Adhere to security best practices and implement measures to protect infrastructure and data from vulnerabilities and threats.
- Collaboration and Communication: Work effectively with cross-functional teams to understand business requirements and provide technical guidance.
- SOP Documentation: Create and maintain documentation for infrastructure processes and incident management protocols.
#IL-MP01
Qualifications :
Required Skills and Experience:
- 7 years of experience as a DevOps engineer or Site Reliability Engineer
- Strong proficiency in AWS cloud services like EC2 S3 VPC RDS EKS ECS CF and more. AWS Certification helps.
- 3 years of experience with serverless architectures using AWS Lambda.
- Strong scripting skills (Python PowerShell CDK Shell scripting).
- Knowledge of CDK (Cloud Development Kit) for infrastructure as code.
- Experience with infrastructure as code tools (Terraform Ansible) and AWX Tower for Ansible automation.
- Knowledge of containerization (Docker) and orchestration platforms (Kubernetes).
- Expertise in CI/CD pipelines and automation tools (Jenkins GitHub).
- Exposure to monitoring and alerting tools (CloudWatch Datadog ELK Grafana NewRelic).
- Documenting SOP and RCAs.
- Understanding of security best practices and compliance standards. Security Certification is a plus.
Remote Work :
No
Employment Type :
Full-time