SRE for Gen AI App Infrastructure and Operations

Not Interested
Bookmark
Report This Job

profile Job Location:

Montreal - Canada

profile Monthly Salary: CAD 10 - 10
profile Experience Required: 5years
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

"AI Infra Ops and SRE engineer

Need to come to office 3 days a week

Skills:
Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python Go Java or equivalent)
Deep experience with containerization (Docker) orchestration (Kubernetes etc.)
Infrastructure-as-code (Terraform Helm CloudFormation Ansible etc.)
Familiarity with GPU / AI compute clusters high-performance data storage and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus Grafana ELK / EFK Datadog etc.)
Networking & systems engineering knowledge (TCP/IP DNS routing load bal-ancing distributed storage)
Solid experience in capacity planning performance tuning scaling and incident response
Demonstrated ability to lead RCAs deploy fixes and drive reliability improve-ments
Experience in regulated environments (financial services compliance audit se-curity) is a strong plus
Excellent communication documentation and cross-team collaboration skills
Proven track record of reducing operational toil via automation


Experience: 8 years of experience as a Site Reliability Engineer or in a similar role with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities:
Operate monitor and maintain the infrastructure supporting GenAI applications (training inference feature store data ingestion model serving)
Design and build automation for core platform capabilities reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute storage network GPU clusters Kubernetes / container orchestration etc.
Establish monitor and enforce SLOs/SLIs/SLAs error budgets alerting and dashboards
Lead incident response root cause analysis (RCA) postmortems and systemic remediation
Perform capacity planning scaling strategies workload scheduling and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security compliance auditability and data governance
Collaborate across teams (cloud engineers data engineers infrastructure secu-rity) to ensure safe deployment rollout rollback and integration of new systems
Define disaster recovery (DR) strategies backup/restore practices fault toler-ance mechanisms
Maintain runbooks operational playbooks documentation and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools frameworks or technologies to enhance platform reliability"



Required Skills:

Experience (Years): 4-6 Essential Skills: Work with project teams throughout the organization to design implement and manage CDN infrastructure using Akamai to ensure high availability performance and scalability for customer facing applications and business processes. Handle multiple priorities and assignments with excellence and precision. Be a part of a 24/7/365 organization (some after hours support is expected as part of normal on-call rotation). Directly support line of business development teams provide guidance to them on implementation and changes for customer facing applications Develop and maintain security protocols and measures to protect CDN infrastructure from cyber threats. Monitor and analyze network performance identifying and resolving issues to optimize content delivery of critical applications. Collaborate with cross-functional teams to integrate Akamai CDN solutions with existing systems and applications. Collaborate with information security teams to implement DDoS protection strategies and other security measures in the CDN. Provide technical support and guidance to clients and internal teams regarding CDN and security best practices. Work closely with vendor and professional service teams on delivery related activities and strategy. Qualifications: Bachelors degree in Computer Science Information Technology or a related field. OR similar work experience. Strong understanding of network protocols (HTTP/HTTPS DNS TCP/IP). Proven experience as a CDN Engineer or similar role with a strong focus on -depth knowledge of Content Delivery Network technologies including caching load balancing and content optimization. Excellent problem-solving skills and attention to detail. Strong communication and teamwork abilities. Experience supporting 24/7/365 customer facing applications at enterprise scale. Awareness and experience with cybersecurity tools and practices such as firewalls intrusion detection/prevention systems and encryption. Proficiency in scripting and automation (e.g. Python Bash) a plus. Relevant certifications (e.g. CISSP CEH) are a plus but not required.

"AI Infra Ops and SRE engineerNeed to come to office 3 days a weekSkills: Production experience in SRE / Infrastructure / ops for large-scale systems Strong programming/scripting skills (Python Go Java or equivalent) Deep experience with containerization (Docker) orchestration (Kubernetes etc.) Infr...
View more view more

Company Industry

IT Services and IT Consulting

Key Skills

  • Ruby
  • Disaster Recovery
  • Active Directory
  • SOA
  • Cloud
  • IaaS
  • PowerShell
  • AWS
  • Infrastructure
  • Linux
  • VPN
  • Hyper-V
  • VM
  • IP
  • Identity