Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailPosition Summary:
The Reliability Engineering Automation team prides itself in keeping Visa systems up and secure catering to the 24*7 needs of the business. The GenAI Senior site reliability Engineer a highly motivated senior individual contributor based in India Bengaluru location responsible for availability latency performance efficiency change management monitoring emergency response and capacity planning. The role is a senior technologist who has the passion to solve problems developing systems and software that help increase site reliability and performance. Site reliability engineering (SRE) fuses the software engineering and operations disciplines in GenAI ecosystem.
Responsibilities:
- System Reliability: Ensure the uptime reliability and scalability of GenAI platforms and services.
- Monitoring & Alerting: Design implement and improve monitoring logging and alerting for AI workloads and infrastructure.
- Incident Response: Respond to investigate and resolve production incidents ensuring minimal disruption to GenAI services.
- Automation: Develop and maintain automation scripts for deployment scaling and recovery of GenAI systems.
- Performance Optimization: Analyze system bottlenecks and optimize resource utilization for AI model training and inference.
- Collaboration: Work closely with ML engineers data scientists DevOps and platform teams to support end-to-end GenAI pipelines.
- Security & Compliance: Implement robust security practices and ensure compliance with relevant data and AI regulations.
- Documentation: Maintain clear documentation for processes runbooks and system architecture.
Required Skills:
- Kubernetes & Containers: Proficiency in Kubernetes Docker and related tools for orchestration of AI workloads.
- Infrastructure as Code: Skills in Terraform Ansible or similar.
- Monitoring & Logging: Familiarity with Prometheus Grafana ELK stack or similar tools.
- Scripting & Programming: Ability to write scripts (Python Bash Go etc.) for automation and tooling.
- CI/CD Pipelines: Knowledge of CI/CD workflows especially for ML/AI projects.
- AI/ML Workloads: Understanding of ML model lifecycle distributed training and inference serving (e.g. using Ray Kubeflow MLFlow).
- Troubleshooting: Strong analytical and troubleshooting skills especially in complex distributed environments.
This is a hybrid position. Expectation of days in office will be confirmed by your Hiring Manager.
Qualifications :
Bachelors or masters in computer science Engineering or a related field.
- Professional Experience: 4 years as an SRE DevOps Engineer or similar preferably supporting AI/ML or large-scale data platforms.
- AI/ML Infrastructure: Hands-on experience operating GPU clusters AI frameworks (TensorFlow PyTorch) and data pipelines is a plus.
- Incident Management: Demonstrated experience in high-severity incident response and postmortem analysis.
- Collaboration: Experience working in cross-functional teams especially with AI/ML practitioners.
Additional Information :
Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race color religion sex national origin sexual orientation gender identity disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.
Remote Work :
No
Employment Type :
Full-time
Full-time