Requirement:
- Experience: 5 years
- Strong experience in DevOps or Site Reliability Engineering (SRE) roles.
- Strong knowledge of Docker Kubernetes Terraform and CI/CD pipelines.
- Hands-on experience with AWS Azure or other cloud platforms.
- Familiarity with GPU infrastructure and ML workloads is a plus.
- Good understanding of monitoring and logging systems (Prometheus Grafana).
- Ability to collaborate with ML teams for optimized inference and deployment.
- Strong troubleshooting and problem-solving skills in high-scale environments.
- Knowledge of infrastructure security best practices cost optimization and performance tuning.
- Exposure to vector databases and AI/ML deployment pipelines is highly desirable.
Responsibilities:
- Maintain and manage Kubernetes clusters AWS/Azure environments and GPU infrastructure for high-performance workloads.
- Design and implement CI/CD pipelines for seamless deployments and faster release cycles.
- Set up and maintain monitoring and logging systems using Prometheus and Grafana to ensure system health and reliability.
- Support vector database scaling and model deployment for AI/ML workloads.
- Collaborate with ML engineering teams to optimize inference performance and resource utilization.
- Ensure high availability security and scalability of infrastructure across multiple environments.
- Automate infrastructure provisioning and configuration using Terraform and other IaC tools.
- Troubleshoot production issues and implement proactive measures to prevent downtime.
- Continuously improve deployment processes and infrastructure reliability through automation and best practices.
- Participate in architecture reviews capacity planning and disaster recovery strategies.
- Drive cost optimization initiatives for cloud resources and GPU utilization.
- Stay updated with emerging technologies in cloud-native AI infrastructure and DevOps automation.
Qualifications :
Bachelors or masters degree in computer science Information Technology or a related field
Remote Work :
No
Employment Type :
Full-time
Requirement:Experience: 5 yearsStrong experience in DevOps or Site Reliability Engineering (SRE) roles.Strong knowledge of Docker Kubernetes Terraform and CI/CD pipelines.Hands-on experience with AWS Azure or other cloud platforms.Familiarity with GPU infrastructure and ML workloads is a plus.Good u...
Requirement:
- Experience: 5 years
- Strong experience in DevOps or Site Reliability Engineering (SRE) roles.
- Strong knowledge of Docker Kubernetes Terraform and CI/CD pipelines.
- Hands-on experience with AWS Azure or other cloud platforms.
- Familiarity with GPU infrastructure and ML workloads is a plus.
- Good understanding of monitoring and logging systems (Prometheus Grafana).
- Ability to collaborate with ML teams for optimized inference and deployment.
- Strong troubleshooting and problem-solving skills in high-scale environments.
- Knowledge of infrastructure security best practices cost optimization and performance tuning.
- Exposure to vector databases and AI/ML deployment pipelines is highly desirable.
Responsibilities:
- Maintain and manage Kubernetes clusters AWS/Azure environments and GPU infrastructure for high-performance workloads.
- Design and implement CI/CD pipelines for seamless deployments and faster release cycles.
- Set up and maintain monitoring and logging systems using Prometheus and Grafana to ensure system health and reliability.
- Support vector database scaling and model deployment for AI/ML workloads.
- Collaborate with ML engineering teams to optimize inference performance and resource utilization.
- Ensure high availability security and scalability of infrastructure across multiple environments.
- Automate infrastructure provisioning and configuration using Terraform and other IaC tools.
- Troubleshoot production issues and implement proactive measures to prevent downtime.
- Continuously improve deployment processes and infrastructure reliability through automation and best practices.
- Participate in architecture reviews capacity planning and disaster recovery strategies.
- Drive cost optimization initiatives for cloud resources and GPU utilization.
- Stay updated with emerging technologies in cloud-native AI infrastructure and DevOps automation.
Qualifications :
Bachelors or masters degree in computer science Information Technology or a related field
Remote Work :
No
Employment Type :
Full-time
View more
View less