Principal Site Reliability Engineer

New York City, NY - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

The job posting is outdated and position may be filled

Job Summary

About SecurityScorecard:

SecurityScorecard is the global leader in cybersecurity ratings with over 12 million companies continuously rated operating in 64 countries. Founded in 2013 by security and risk experts Dr. Alex Yampolskiy and Sam Kassoumeh and funded by world-class investors SecurityScorecards patented rating technology is used by over 25000 organizations for self-monitoring third-party risk management board reporting and cyber insurance underwriting; making all organizations more resilient by allowing them to easily find and fix cybersecurity risks across their digital footprint.

Headquartered in New York City our culture has been recognized by Inc Magazine as a Best Workplace by Crains NY as a Best Places to Work in NYC and as one of the 10 hottest SaaS startups in New York for two years in a row. Most recently SecurityScorecard was named to Fast Companys annual list of the Worlds Most Innovative Companies for 2023 and to the Achievers 50 Most Engaged Workplaces in 2023 award recognizing forward-thinking employers for their unwavering commitment to employee engagement. SecurityScorecard is proud to be funded by world-class investors including Silver Lake Waterman Moodys Sequoia Capital GV and Riverwood Capital.

Role Overview Principal Site Reliability Engineer ML/AI Infrastructure

As a Principal Site Reliability Engineer (SRE) focused on ML and AI initiatives you will play a critical role in designing and scaling infrastructure that powers advanced machine learning workloads. You will lead the development of highly reliable observable and automated Kubernetes-based platforms that support model training inference and continuous delivery of ML applications. Working closely with ML engineers data scientists and platform teams you will help operationalize machine learning workflows and bring cutting-edge AI capabilities into production with confidence and speed.

Key Responsibilities

Design and scale Kubernetes infrastructure purpose-built for ML/AI workloads including GPU scheduling autoscaling and secure multi-tenant clusters.
Enhance CI/CD pipelines for ML applications and model delivery (MLOps) including support for reproducible training model versioning and shadow testing.
Implement progressive delivery strategies (e.g. canary A/B testing) for machine learning models to ensure safe and incremental rollout of experiments.
Partner with ML teams to operationalize ML workflows with tools like MLflow Kubeflow or Vertex AI and integrate these into the broader platform architecture.
Integrate and support Apache Kafka for streaming data ingestion and real-time feature delivery for ML pipelines.
Deploy and maintain Airflow pipelines for orchestrating complex ML workflows and data preparation tasks.
Build and optimize infrastructure and workflows for Langsmith/Langfuse to support observability and tracing of LLM-based applications and agents in production.
Lead improvements in Infrastructure as Code using Terraform Helm and Argo CD while establishing reusable and secure infrastructure patterns for AI applications.
Support YugabyteDB as a high-performance distributed database backend for ML and AI services requiring strong consistency and scale.
Define and enforce automated testing strategies tailored to ML environments such as data validation model performance regression and pipeline integration tests.
Drive observability and alerting across ML pipelines and services including monitoring data drift model latency and system-level metrics using tools like Prometheus OpenTelemetry New Relic and Datadog.
Actively support incident response for ML systems and infrastructure focusing on root cause analysis and resilient remediation strategies.
Mentor engineers and champion best practices across ML platform and infrastructure teams.

Qualifications

6 years in SRE DevOps or infrastructure roles including 2 years supporting machine learning or data-intensive workloads in production
Deep experience running Kubernetes in production especially with ML workloads (GPU scheduling autoscaling pod optimization)
Proven track record building CI/CD pipelines for ML systems using tools like GitHub Actions GitLab CI or Jenkins
Strong command of cloud-native infrastructure (EKS GKE or AKS) including GPU provisioning and autoscaling for AI workloads
Familiarity with MLOps and workflow orchestration tools such as MLflow Kubeflow Airflow and Argo Workflows
Proficiency in Infrastructure as Code and GitOps with Terraform Helm and Argo CD
Experience managing event streaming (Kafka) distributed databases (YugabyteDB) and LLM observability (Langsmith / Langfuse)
Programming/scripting ability in Python Go or Bash for automating infrastructure or ML pipelines
Solid knowledge of monitoring and observability tools (Prometheus OpenTelemetry New Relic Datadog)
Strong communication and mentoring skills; able to influence cross-functional teams

Required Experience:

Staff IC

About SecurityScorecard:SecurityScorecard is the global leader in cybersecurity ratings with over 12 million companies continuously rated operating in 64 countries. Founded in 2013 by security and risk experts Dr. Alex Yampolskiy and Sam Kassoumeh and funded by world-class investors SecurityScorecar...

About SecurityScorecard:

Role Overview Principal Site Reliability Engineer ML/AI Infrastructure

Key Responsibilities

Design and scale Kubernetes infrastructure purpose-built for ML/AI workloads including GPU scheduling autoscaling and secure multi-tenant clusters.
Enhance CI/CD pipelines for ML applications and model delivery (MLOps) including support for reproducible training model versioning and shadow testing.
Implement progressive delivery strategies (e.g. canary A/B testing) for machine learning models to ensure safe and incremental rollout of experiments.
Partner with ML teams to operationalize ML workflows with tools like MLflow Kubeflow or Vertex AI and integrate these into the broader platform architecture.
Integrate and support Apache Kafka for streaming data ingestion and real-time feature delivery for ML pipelines.
Deploy and maintain Airflow pipelines for orchestrating complex ML workflows and data preparation tasks.
Build and optimize infrastructure and workflows for Langsmith/Langfuse to support observability and tracing of LLM-based applications and agents in production.
Lead improvements in Infrastructure as Code using Terraform Helm and Argo CD while establishing reusable and secure infrastructure patterns for AI applications.
Support YugabyteDB as a high-performance distributed database backend for ML and AI services requiring strong consistency and scale.
Define and enforce automated testing strategies tailored to ML environments such as data validation model performance regression and pipeline integration tests.
Drive observability and alerting across ML pipelines and services including monitoring data drift model latency and system-level metrics using tools like Prometheus OpenTelemetry New Relic and Datadog.
Actively support incident response for ML systems and infrastructure focusing on root cause analysis and resilient remediation strategies.
Mentor engineers and champion best practices across ML platform and infrastructure teams.

Qualifications

6 years in SRE DevOps or infrastructure roles including 2 years supporting machine learning or data-intensive workloads in production
Deep experience running Kubernetes in production especially with ML workloads (GPU scheduling autoscaling pod optimization)
Proven track record building CI/CD pipelines for ML systems using tools like GitHub Actions GitLab CI or Jenkins
Strong command of cloud-native infrastructure (EKS GKE or AKS) including GPU provisioning and autoscaling for AI workloads
Familiarity with MLOps and workflow orchestration tools such as MLflow Kubeflow Airflow and Argo Workflows
Proficiency in Infrastructure as Code and GitOps with Terraform Helm and Argo CD
Experience managing event streaming (Kafka) distributed databases (YugabyteDB) and LLM observability (Langsmith / Langfuse)
Programming/scripting ability in Python Go or Bash for automating infrastructure or ML pipelines
Solid knowledge of monitoring and observability tools (Prometheus OpenTelemetry New Relic Datadog)
Strong communication and mentoring skills; able to influence cross-functional teams