Role : Azure Cloud Engineer AI
No of Position : 1
Client : Bank of Montreal
Location : Toronto Canada
Pay rate : CAD 90 (Inclusive of all)
Work Model : Hybrid (3 days to office)
Cloud Engineer AI Infrastructure
Role Overview
As a Cloud Engineer you will be responsible for implementing and maintaining scalable secure and high-performance cloud infrastructure to support AI/ML workloads. Youll work closely with platform application and data teams to ensure reliable operations and efficient delivery of AI services.
Key Responsibilities
Infrastructure & Platform Operations
- Deploy and manage cloud-native infrastructure for AI/ML workloads (GPU/CPU clusters autoscaling spot instances).
- Configure and maintain networking components (Azure VNet Private Link peering HA/DR setups).
- Operate storage and database systems including Azure Data Lake Storage relational databases and vector databases (FAISS Milvus Pinecone).
- Implement IAM policies secrets management (Key Vault) and encryption standards.
Observability & Reliability
- Set up monitoring for latency throughput GPU utilization and cost metrics.
- Integrate logging and tracing tools (OpenTelemetry) and maintain SLOs/SLIs for infrastructure services.
- Support incident response and root cause analysis using SRE principles.
CI/CD & Infrastructure Automation
- Build and maintain CI/CD pipelines using GitHub Actions or Azure DevOps.
- Implement GitOps workflows for infrastructure-as-code using Terraform or Bicep.
- Create reusable IaC modules and templates for consistent deployments.
FinOps & Cost Optimization
- Monitor and optimize GPU usage caching strategies and inference performance.
- Support cost governance and reporting for AI infrastructure.
Application Enablement
- Provide infrastructure support for APIs microservices and event-driven architectures.
- Enable model serving runtimes (TensorRT-LLM vLLM Triton/KServe).
- Support RAG pipelines including embeddings chunking and retrieval systems.
Security & Compliance
- Apply defense-in-depth strategies: IAM least privilege private networking image signing.
- Ensure compliance with data residency encryption and audit requirements.
Qualifications
- Bachelors degree in Computer Science Engineering or related field.
- 3 5 years of experience in cloud infrastructure (Azure preferred).
- Hands-on experience with Kubernetes Terraform/Bicep and cloud networking.
- Familiarity with AI/ML infrastructure components and model serving.
- Proficiency in Python for automation; Go or TypeScript is a plus.
Tech Stack
- Cloud & Infra: Azure (AKS Functions Event Hubs Key Vault) Terraform/Bicep GitHub Actions
- AI Infra: Kubernetes KServe/Triton vLLM TensorRT-LLM
- Ops: Prometheus Grafana OpenTelemetry ArgoCD
- Data: Feature stores (Feast) vector DBs (FAISS Milvus) relational DBs
- App Layer: APIs microservices frontend/backend integration
Success Metrics
- Reliability: SLOs met uptime maintained
- Security: No critical vulnerabilities audit-ready infrastructure
- Cost Efficiency: Optimized GPU and infra spend
- Velocity: Fast and reliable deployments
- Collaboration: Effective cross-team support and documentation
Role : Azure Cloud Engineer AI No of Position : 1 Client : Bank of Montreal Location : Toronto Canada Pay rate : CAD 90 (Inclusive of all) Work Model : Hybrid (3 days to office) Cloud Engineer AI Infrastructure Role Overview As a Cloud Engineer you will be responsible for implementing and mai...
Role : Azure Cloud Engineer AI
No of Position : 1
Client : Bank of Montreal
Location : Toronto Canada
Pay rate : CAD 90 (Inclusive of all)
Work Model : Hybrid (3 days to office)
Cloud Engineer AI Infrastructure
Role Overview
As a Cloud Engineer you will be responsible for implementing and maintaining scalable secure and high-performance cloud infrastructure to support AI/ML workloads. Youll work closely with platform application and data teams to ensure reliable operations and efficient delivery of AI services.
Key Responsibilities
Infrastructure & Platform Operations
- Deploy and manage cloud-native infrastructure for AI/ML workloads (GPU/CPU clusters autoscaling spot instances).
- Configure and maintain networking components (Azure VNet Private Link peering HA/DR setups).
- Operate storage and database systems including Azure Data Lake Storage relational databases and vector databases (FAISS Milvus Pinecone).
- Implement IAM policies secrets management (Key Vault) and encryption standards.
Observability & Reliability
- Set up monitoring for latency throughput GPU utilization and cost metrics.
- Integrate logging and tracing tools (OpenTelemetry) and maintain SLOs/SLIs for infrastructure services.
- Support incident response and root cause analysis using SRE principles.
CI/CD & Infrastructure Automation
- Build and maintain CI/CD pipelines using GitHub Actions or Azure DevOps.
- Implement GitOps workflows for infrastructure-as-code using Terraform or Bicep.
- Create reusable IaC modules and templates for consistent deployments.
FinOps & Cost Optimization
- Monitor and optimize GPU usage caching strategies and inference performance.
- Support cost governance and reporting for AI infrastructure.
Application Enablement
- Provide infrastructure support for APIs microservices and event-driven architectures.
- Enable model serving runtimes (TensorRT-LLM vLLM Triton/KServe).
- Support RAG pipelines including embeddings chunking and retrieval systems.
Security & Compliance
- Apply defense-in-depth strategies: IAM least privilege private networking image signing.
- Ensure compliance with data residency encryption and audit requirements.
Qualifications
- Bachelors degree in Computer Science Engineering or related field.
- 3 5 years of experience in cloud infrastructure (Azure preferred).
- Hands-on experience with Kubernetes Terraform/Bicep and cloud networking.
- Familiarity with AI/ML infrastructure components and model serving.
- Proficiency in Python for automation; Go or TypeScript is a plus.
Tech Stack
- Cloud & Infra: Azure (AKS Functions Event Hubs Key Vault) Terraform/Bicep GitHub Actions
- AI Infra: Kubernetes KServe/Triton vLLM TensorRT-LLM
- Ops: Prometheus Grafana OpenTelemetry ArgoCD
- Data: Feature stores (Feast) vector DBs (FAISS Milvus) relational DBs
- App Layer: APIs microservices frontend/backend integration
Success Metrics
- Reliability: SLOs met uptime maintained
- Security: No critical vulnerabilities audit-ready infrastructure
- Cost Efficiency: Optimized GPU and infra spend
- Velocity: Fast and reliable deployments
- Collaboration: Effective cross-team support and documentation
View more
View less