Site Reliability Engineer

TalentOla

Job Location:

Columbus, NE - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Summary:

As a Site Reliability Engineer (SRE) Level II you will play a key role in maintaining the availability scalability and performance of critical infrastructure and services. You will be responsible for building and automating solutions that enhance system reliability and support continuous this role you will handle more complex operational tasks and incidents provide mentorship to junior SREs and collaborate with development teams to ensure systems are designed for reliability from the ground up.
Incident Management :
complex incidents and ensure service uptime.
Lead troubleshooting efforts for high-impact production issues providing detailed root cause analysis (RCA) and preventative measures.
Participate in on-call rotations acting as an escalation point for Level 1 SREs during major incidents.
Automation & Infrastructure as Code (IaC):
Develop and maintain automation scripts and infrastructure using tools like Terraform Ansible or CloudFormation.
Implement automation solutions to eliminate manual tasks and improve system reliability scalability and performance.
Performance & Scalability:
Analyze system performance and recommend optimizations for scalability and reliability.
Support capacity planning efforts by monitoring system metrics traffic
patterns and usage trends to predict future resource needs.
System Design & Architecture:
Collaborate with software engineering teams to influence the design of new services and applications ensuring they are scalable reliable and resilient from the start.
Contribute to architectural decisions ensuring alignment with best practices in fault tolerance redundancy and recovery.
Monitoring & Observability:
Build and maintain robust monitoring alerting and observability solutions to proactively detect and resolve issues before they impact end users.
Optimize existing monitoring tools (e.g. Prometheus Grafana Datadog Dynatrace) and build custom dashboards for better visibility into system health.
Security & Compliance:
Ensure systems and infrastructure are secure compliant and aligned with organizational policies and industry best practices.
Assist with vulnerability management system patching and implementing security measures to protect the integrity and availability of services.
Continuous Improvement:
Lead efforts to continuously improve operational processes tools and workflows.
Implement and enforce best practices in deployment monitoring and incident management to improve overall system reliability and reduce downtime.

Basic Qualifications

Bachelors degree in computer science Information Technology or a related field or equivalent work experience.
3 years of experience in site reliability engineering application monitoring systems administration or related roles.
Proven track record of managing complex infrastructure troubleshooting production issues and optimizing system performance

Preferred Qualifications

Strong experience with Linux/Unix administration and proficiency in scripting (e.g. Python Bash Go).
5 years of experience in site reliability engineering DevOps systems administration or related roles.
Experience with containerization and orchestration technologies like Docker and Kubernetes.
Familiarity with distributed systems and microservices architecture.
Excellent problem-solving and troubleshooting skills especially in diagnosing production issues in high-scale environments.
Microsoft Office experience
Experience working in multi-platform environment
Ability to balance both development and support roles
Experience in working on projects that involve business segments
Strong analytical strong troubleshooting skills and excellent communication skills
Strong interpersonal skills focus on customer service and the ability to work well with other IT vendor and business groups

Summary: As a Site Reliability Engineer (SRE) Level II you will play a key role in maintaining the availability scalability and performance of critical infrastructure and services. You will be responsible for building and automating solutions that enhance system reliability and support continuous ...

Summary:

As a Site Reliability Engineer (SRE) Level II you will play a key role in maintaining the availability scalability and performance of critical infrastructure and services. You will be responsible for building and automating solutions that enhance system reliability and support continuous this role you will handle more complex operational tasks and incidents provide mentorship to junior SREs and collaborate with development teams to ensure systems are designed for reliability from the ground up.
Incident Management :
complex incidents and ensure service uptime.
Lead troubleshooting efforts for high-impact production issues providing detailed root cause analysis (RCA) and preventative measures.
Participate in on-call rotations acting as an escalation point for Level 1 SREs during major incidents.
Automation & Infrastructure as Code (IaC):
Develop and maintain automation scripts and infrastructure using tools like Terraform Ansible or CloudFormation.
Implement automation solutions to eliminate manual tasks and improve system reliability scalability and performance.
Performance & Scalability:
Analyze system performance and recommend optimizations for scalability and reliability.
Support capacity planning efforts by monitoring system metrics traffic
patterns and usage trends to predict future resource needs.
System Design & Architecture:
Collaborate with software engineering teams to influence the design of new services and applications ensuring they are scalable reliable and resilient from the start.
Contribute to architectural decisions ensuring alignment with best practices in fault tolerance redundancy and recovery.
Monitoring & Observability:
Build and maintain robust monitoring alerting and observability solutions to proactively detect and resolve issues before they impact end users.
Optimize existing monitoring tools (e.g. Prometheus Grafana Datadog Dynatrace) and build custom dashboards for better visibility into system health.
Security & Compliance:
Ensure systems and infrastructure are secure compliant and aligned with organizational policies and industry best practices.
Assist with vulnerability management system patching and implementing security measures to protect the integrity and availability of services.
Continuous Improvement:
Lead efforts to continuously improve operational processes tools and workflows.
Implement and enforce best practices in deployment monitoring and incident management to improve overall system reliability and reduce downtime.

Basic Qualifications

Bachelors degree in computer science Information Technology or a related field or equivalent work experience.
3 years of experience in site reliability engineering application monitoring systems administration or related roles.
Proven track record of managing complex infrastructure troubleshooting production issues and optimizing system performance

Preferred Qualifications

Strong experience with Linux/Unix administration and proficiency in scripting (e.g. Python Bash Go).
5 years of experience in site reliability engineering DevOps systems administration or related roles.
Experience with containerization and orchestration technologies like Docker and Kubernetes.
Familiarity with distributed systems and microservices architecture.
Excellent problem-solving and troubleshooting skills especially in diagnosing production issues in high-scale environments.
Microsoft Office experience
Experience working in multi-platform environment
Ability to balance both development and support roles
Experience in working on projects that involve business segments
Strong analytical strong troubleshooting skills and excellent communication skills
Strong interpersonal skills focus on customer service and the ability to work well with other IT vendor and business groups