DescriptionResponsibilities:
- Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.
- Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing.
- Develop observability security automation and fin-ops tools and orchestration.
- Provide strategic technology leadership by defining and evaluating standards and architecture for reliability observability and automation frameworks.
- Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems.
- Debug and solve issues in a production environment identify root cause and remediate.
- Participates in on-call rotations incident management and escalation workflows.
- Take full ownership of problems develop solutions and acquire new knowledge to complete the task.
- Mentor and guide junior engineers.
Required Qualifications:
- Bachelors degree in computer science Information Technology or equivalent technical qualification with 5 years professional experience.
- Expertise in SRE principles reliability scalability and performance of application and infrastructure.
- Have hands-on experience with cloud platforms (AWS GCP Azure) and IaC tools (Terraform Ansible).
- Extensive experience implementing advanced observability using tools like Open Telemetry Dynatrace Grafana and/or cloud-native services.
- Experience in architecting distributed systems and cloud-native architecture in AWS.
- Systematic problem-solving and troubleshooting skills in a complex system.
- Excellent communication skills and ability to represent and present business and technical concepts to stakeholders.
- Self-managed self-motivated with strong sense of ownership urgency and drive
Good to have:
- Prior experience working in AI ML or Data engineering.
- Priorexperience developing AI Ops/AI Agents.
- Multi cloud experience (AWS GCP Azure) is a plus
Required Experience:
IC
DescriptionResponsibilities:Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing.Develop observability secu...
DescriptionResponsibilities:
- Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.
- Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing.
- Develop observability security automation and fin-ops tools and orchestration.
- Provide strategic technology leadership by defining and evaluating standards and architecture for reliability observability and automation frameworks.
- Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems.
- Debug and solve issues in a production environment identify root cause and remediate.
- Participates in on-call rotations incident management and escalation workflows.
- Take full ownership of problems develop solutions and acquire new knowledge to complete the task.
- Mentor and guide junior engineers.
Required Qualifications:
- Bachelors degree in computer science Information Technology or equivalent technical qualification with 5 years professional experience.
- Expertise in SRE principles reliability scalability and performance of application and infrastructure.
- Have hands-on experience with cloud platforms (AWS GCP Azure) and IaC tools (Terraform Ansible).
- Extensive experience implementing advanced observability using tools like Open Telemetry Dynatrace Grafana and/or cloud-native services.
- Experience in architecting distributed systems and cloud-native architecture in AWS.
- Systematic problem-solving and troubleshooting skills in a complex system.
- Excellent communication skills and ability to represent and present business and technical concepts to stakeholders.
- Self-managed self-motivated with strong sense of ownership urgency and drive
Good to have:
- Prior experience working in AI ML or Data engineering.
- Priorexperience developing AI Ops/AI Agents.
- Multi cloud experience (AWS GCP Azure) is a plus
Required Experience:
IC
View more
View less