- Role: GenAI SRE DevOps (L3) support
- Position : Onsite
- Location : New York.
Key responsibilities
- Provide Technical Support: Offer expert-level support for GenAI products and services including troubleshooting debugging and resolving technical issues.
- Troubleshooting and Problem Resolution: Identify diagnose and resolve complex technical problems involving AI models algorithms prompts and software.
- End User Interaction: Work directly with end users to understand their needs gather requirements and provide solutions for GenAI-related challenges.
- Platform Enhancement and Optimization: Support and maintain a scalable fault-tolerant AI/ML platform monitor and troubleshoot data pipeline and model inference performance and optimize AI models for real-time applications.
- Collaboration: Collaborate closely with engineering DevOps and product teams to improve product features functionality and reliability based on customer feedback and ensure seamless deployment improvements.
- Documentation and Training: Create and maintain detailed documentation of support processes issues and resolutions for internal and external stakeholders. Assist in training customers and internal teams on best practices for using and maintaining GenAI products.
- Monitoring and Reporting: Monitor system performance perform regular health checks and generate reports on system status incidents and resolutions.
- Continuous Improvement: Stay up-to-date with the latest advancements in AI/ML/GenAI technologies and contribute to the continuous improvement of support processes and tools.
Required skills and qualifications
- Experience: Proven experience in technical support or engineering preferably in AI/ML/GenAI environments.
- Technical Proficiency: Expertise in GenAI models (e.g. GPT Claude PaLM2 Llama2) and frameworks (e.g. RAG Agents COT).
- Cloud Platform and DevOps: Hands-on experience with cloud platforms (Azure AWS GCP) and DevOps tools.
- Database knowledge: SQL/Sybase/Mongo DB any data warehouse (Snowflake Databricks) experience.
- Scripting and Automation: Strong proficiency in Python Shell scripting and other relevant programming and UI languages like Java Angular http protocols.
- Monitoring tools knowledge like Splunk AppDynamics Autosys Grafana/ Loki/ Prometheus
- ITIL application support management processes: Incident/Problem/Service/Jira management.
- Kubernetes and Containerization: Familiarity with containerization technologies like Docker Loki and orchestration tools like Kubernetes (preferably EKS or OpenShift).
- Problem-solving and Analytical Skills: Excellent problem-solving analytical and troubleshooting skills with strong attention to detail.
- Communication and Collaboration: Strong command and control with good communication and interpersonal skills to collaborate effectively with diverse teams and stakeholders across global teams.
- Educational Background: Bachelors degree in Computer Science Engineering or a related field.
Preferred qualifications
- Experience with natural language processing (NLP) and machine learning (ML) models.
- Familiarity with large language models (LLMs) such as GPT-3.5 Turbo GPT-4.0 and GPT-4-O.
- Experience with OpenAI technologies and managing GenAI services in cloud environments.
- Understanding of MLOps practices and model lifecycle management.
- Familiarity with application monitoring solutions like Dynatrace and Splunk.
Note: Momento USA is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race color religion sex pregnancy sexual orientation gender identity national origin age protected veteran status or disability status.