Site Reliability Engineer

Cooper-Standard Automotive

Job Location:

Northville, NY - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Job Description:

Site Reliability Engineer (SRE)

About Liveline

Liveline enables dramatic improvements in manufacturing performance thorough a unique application of artificial intelligence to provide real-time process control and predictive assistants for plant personnel. Our focus is on automating complex processes not simply providing dashboards for managers and operators.

Our team combines experts in AI with world-class process engineers who can focus on the last mile with customers: Extracting data from the process and implementing controls on the shop floor. We speak the language of AI but also industrial controllers.

Our hardware and software offerings are scalable and cost-effective whether customers have one production line or hundreds delivering an ROI thats attractive to small and medium-sized enterprises.

We are passionate about democratizing the power of analytics and advanced automation for manufacturers of almost any size. Through our approach producers can de-mystify complex processes and free up valuable technicians to focus on more advanced tasks instead of constantly monitoring and adjusting equipment parameters.

A Liveline Technologies SRE is responsible for the reliability performance observability and operational excellence of Livelines production services. This spans from the factory-floor edge systems to AWS cloud components. You will help build and run resilient infrastructure automate repetitive work with code (Terraform Bash Python) implement monitoring and alerting (Prometheus/Grafana) and participate in incident response/on-call to ensure uptime for mission-critical manufacturing systems. Youll collaborate closely with controls engineers data scientists and software teams to safely deploy changes define SLIs/SLOs and continuously improve availability and latency for real-time process control.

Primary Responsibilities

Operate Production Systems: Maintain high availability performance and security of Livelines production stack across AWS and plant/edge environments.
Observability & Monitoring: Stand up tune and maintain Prometheus/Grafana dashboards alerts recording rules and runbooks. Implement logs/traces (e.g. OpenTelemetry) and actionable alerting.
Infrastructure as Code: Build and manage reproducible infrastructure with Terraform (VPC IAM EC2/EKS/ECS RDS S3 CloudWatch CloudTrail). Apply version control code reviews and plan/apply workflows.
Automation & Tooling: Write Bash and Python scripts and small services to automate operational tasks health checks failover routines backup/restore and environment bootstrapping.

NOC / Incident Response: Participate in a follow-the-sun/on-call rotation; triage and resolve incidents lead initial comms and produce blameless postmortems with clear corrective actions.
SLIs/SLOs/Error Budgets: Define and instrument SLIs (availability latency error rate freshness) set SLOs with stakeholders and manage error budgets to guide release velocity and reliability tradeoffs.
Networking & Connectivity: Support secure reliable connectivity between factory networks and cloud (site-to-site VPNs routing DNS TLS private subnets security groups network ACLs).
Databases & Storage: Operate and tune PostgreSQL/TimescaleDB InfluxDB or similar time-series/relational stores; manage backups PITR replication partitioning and performance baselining.
CI/CD & Release Engineering: Contribute to build/deploy pipelines (e.g. GitHub Actions/GitLab CI) implement canaries/blue-green strategies and enforce change management and rollback plans.
Security & Compliance: Enforce least-privilege IAM secret management (AWS Secrets Manager/SSM) encryption artifact signing and basic hardening for Linux and Kubernetes workloads.
Edge & OT Collaboration: Partner with process/controls engineers to ensure reliable data ingestion from PLCs/industrial gateways (e.g. OPC UA/Modbus) and safe deploys to plant edge nodes.
Cost Capacity & Performance: Right-size compute/storage set budgets/alerts forecast capacity and optimize resource utilization without compromising SLOs.
Documentation & Runbooks: Author and maintain runbooks architecture diagrams operational playbooks and disaster recovery procedures.

Education and Qualifications:

Bachelors Degree in IT Computer Science or Computer Engineering (or equivalent experience).
5 years of experience in a corporate IT or startup setting
Familiar with containers (Docker) and orchestration (Kubernetes or ECS).
Experience running production workloads participating in on-call and writing postmortems.
Strong communication skills with the ability to explain tradeoffs to non-SRE stakeholders.
Intellectual curiosity ownership mindset and bias for automation.
Willingness and ability to travel to customer sites and plants as necessary.

Nice to Have

Kubernetes (EKS) Helm Kustomize.
Service Mesh/Ingress (Envoy NGINX ALB).
Logging/Tracing: OpenSearch/ELK Loki OpenTelemetry.
Config Management: Ansible.
Secrets & PKI: HashiCorp Vault mTLS.
Edge/Industrial Protocols: OPC UA Modbus MQTT; experience with industrial gateways.
Compliance exposure (SOC 2 ISO 27001) and change management (ITIL).

Position Type:

Regular

Additional Locations:

Additional Information:

Cooper Standard is proud of its diverse workforce and committed to providing equal employment opportunities to applicants and employees without regard to race color religion sex national origin genetic information physical or mental disability age veteran or military status or any other characteristic protected by applicable law. We are dedicated to creating an environment at work that not only values diversity but also encourages inclusion and a sense of belonging. We firmly believe that a diverse workplace fosters an environment where our employees can flourish and provide superior service to our customers. Because we recognize and value the range of ways in which people acquire experiences whether personal professional or via education or volunteerism we invite interested applicants to evaluate the key duties and requirements and apply for any opportunities that fit your experience and qualifications. Applicants with disabilities may be entitled to reasonable accommodations under the Americans with Disabilities Act as well as certain state and/or local laws. If you believe you require such assistance to complete our online application or to participate in an interview you (or someone on your behalf) may request assistance by emailing with a description of the accommodation you seek. Application materials submitted to this email address will not be considered.

Remote Status:

Remote

Required Experience:

Job Description:Site Reliability Engineer (SRE) About LivelineLiveline enables dramatic improvements in manufacturing performance thorough a unique application of artificial intelligence to provide real-time process control and predictive assistants for plant personnel. Our focus is on automating co...

Job Description:

Site Reliability Engineer (SRE)

About Liveline

Our hardware and software offerings are scalable and cost-effective whether customers have one production line or hundreds delivering an ROI thats attractive to small and medium-sized enterprises.

Primary Responsibilities

Operate Production Systems: Maintain high availability performance and security of Livelines production stack across AWS and plant/edge environments.
Observability & Monitoring: Stand up tune and maintain Prometheus/Grafana dashboards alerts recording rules and runbooks. Implement logs/traces (e.g. OpenTelemetry) and actionable alerting.
Infrastructure as Code: Build and manage reproducible infrastructure with Terraform (VPC IAM EC2/EKS/ECS RDS S3 CloudWatch CloudTrail). Apply version control code reviews and plan/apply workflows.
Automation & Tooling: Write Bash and Python scripts and small services to automate operational tasks health checks failover routines backup/restore and environment bootstrapping.

NOC / Incident Response: Participate in a follow-the-sun/on-call rotation; triage and resolve incidents lead initial comms and produce blameless postmortems with clear corrective actions.
SLIs/SLOs/Error Budgets: Define and instrument SLIs (availability latency error rate freshness) set SLOs with stakeholders and manage error budgets to guide release velocity and reliability tradeoffs.
Networking & Connectivity: Support secure reliable connectivity between factory networks and cloud (site-to-site VPNs routing DNS TLS private subnets security groups network ACLs).
Databases & Storage: Operate and tune PostgreSQL/TimescaleDB InfluxDB or similar time-series/relational stores; manage backups PITR replication partitioning and performance baselining.
CI/CD & Release Engineering: Contribute to build/deploy pipelines (e.g. GitHub Actions/GitLab CI) implement canaries/blue-green strategies and enforce change management and rollback plans.
Security & Compliance: Enforce least-privilege IAM secret management (AWS Secrets Manager/SSM) encryption artifact signing and basic hardening for Linux and Kubernetes workloads.
Edge & OT Collaboration: Partner with process/controls engineers to ensure reliable data ingestion from PLCs/industrial gateways (e.g. OPC UA/Modbus) and safe deploys to plant edge nodes.
Cost Capacity & Performance: Right-size compute/storage set budgets/alerts forecast capacity and optimize resource utilization without compromising SLOs.
Documentation & Runbooks: Author and maintain runbooks architecture diagrams operational playbooks and disaster recovery procedures.

Education and Qualifications:

Bachelors Degree in IT Computer Science or Computer Engineering (or equivalent experience).
5 years of experience in a corporate IT or startup setting
Familiar with containers (Docker) and orchestration (Kubernetes or ECS).
Experience running production workloads participating in on-call and writing postmortems.
Strong communication skills with the ability to explain tradeoffs to non-SRE stakeholders.
Intellectual curiosity ownership mindset and bias for automation.
Willingness and ability to travel to customer sites and plants as necessary.

Nice to Have

Kubernetes (EKS) Helm Kustomize.
Service Mesh/Ingress (Envoy NGINX ALB).
Logging/Tracing: OpenSearch/ELK Loki OpenTelemetry.
Config Management: Ansible.
Secrets & PKI: HashiCorp Vault mTLS.
Edge/Industrial Protocols: OPC UA Modbus MQTT; experience with industrial gateways.
Compliance exposure (SOC 2 ISO 27001) and change management (ITIL).

Position Type:

Regular

Additional Locations:

Additional Information:

Remote Status:

Remote

Required Experience:

Key Skills

Apply Now

About Company

Cooper-Standard Automotive

OUR INNOVATION. YOUR ADVANTAGE. Cooper Standard is a leading materials science and manufacturing expert headquartered in Northville, Mich. USA. We operate in 21 countries and with a global team of 25,000 employees. We develop materials, systems and components for a wide range of diver ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click