Site Reliability Engineer
Location: Malven PA Hybrid Schedule
Job Description
As a Senior Reliability Engineer you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innovative thinking with pragmatism and understand the long-term impacts of technical decisions. You communicate complex ideas clearly and collaborate effectively to deliver scalable solutions.
Core Responsibilities
In terms of the candidate profile seeking an individual with a blend of software engineering and operational experience. This team is focused on a specialized subset of Site Reliability Engineering particularly observability engineering. Their work involves integrating telemetry into application platforms within AWS environments (e.g. ECS Lambda) supporting languages such as Python Java and JavaScript. Ideally candidates should have strong JavaScript expertise across both client and server sides along with a solid understanding of debugging event loops and memory management. While experience with telemetry is preferred it is not a strict requirement.
* Improve resiliency engineering practices across platforms and applications including resilient application design patterns system observability and deployment strategies
* Incident detection troubleshooting and resolution.
* Develop automation for incident response and infrastructure management
* Develop and support OpenTelemetry integrations for multiple application platforms (browser ECS lambda etc) and languages (JavaScript Java)
* Contribute to architectural decisions and support implementation of solutions.
Skills and Qualifications
* Deep knowledge of Java or Javascript. Practical experience developing and operating software in distributed systems environments.
* Problem-solving and analytical thinking: ability to diagnose complex issues and propose efficient solutions. Strong debugging and optimization skills for performance and scalability.
* Cloud platforms: Hands-on experience with AWS services and cloud infrastructure
* System architecture and design: ability to design scalable secure and maintainable systems.
* Working knowledge of Python (or similar scripting language).
* Strong knowledge of resiliency engineering techniques for both platforms and applications.
* Experience troubleshooting complex production issues and implementing effective mitigations.
* Familiarity with OpenTelemetry specification and core APIs.
From a screening perspective we recommend focusing on:
- How candidates approach software releases and validate functionality
- Their understanding of system dependencies and fault tolerance
- Experience with diagnosing and resolving production issues
- Their ability to reflect on past incidents and identify improvements
- Evidence of systems thinking and architectural awareness
Site Reliability Engineer Location: Malven PA Hybrid Schedule Job Description As a Senior Reliability Engineer you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innova...
Site Reliability Engineer
Location: Malven PA Hybrid Schedule
Job Description
As a Senior Reliability Engineer you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innovative thinking with pragmatism and understand the long-term impacts of technical decisions. You communicate complex ideas clearly and collaborate effectively to deliver scalable solutions.
Core Responsibilities
In terms of the candidate profile seeking an individual with a blend of software engineering and operational experience. This team is focused on a specialized subset of Site Reliability Engineering particularly observability engineering. Their work involves integrating telemetry into application platforms within AWS environments (e.g. ECS Lambda) supporting languages such as Python Java and JavaScript. Ideally candidates should have strong JavaScript expertise across both client and server sides along with a solid understanding of debugging event loops and memory management. While experience with telemetry is preferred it is not a strict requirement.
* Improve resiliency engineering practices across platforms and applications including resilient application design patterns system observability and deployment strategies
* Incident detection troubleshooting and resolution.
* Develop automation for incident response and infrastructure management
* Develop and support OpenTelemetry integrations for multiple application platforms (browser ECS lambda etc) and languages (JavaScript Java)
* Contribute to architectural decisions and support implementation of solutions.
Skills and Qualifications
* Deep knowledge of Java or Javascript. Practical experience developing and operating software in distributed systems environments.
* Problem-solving and analytical thinking: ability to diagnose complex issues and propose efficient solutions. Strong debugging and optimization skills for performance and scalability.
* Cloud platforms: Hands-on experience with AWS services and cloud infrastructure
* System architecture and design: ability to design scalable secure and maintainable systems.
* Working knowledge of Python (or similar scripting language).
* Strong knowledge of resiliency engineering techniques for both platforms and applications.
* Experience troubleshooting complex production issues and implementing effective mitigations.
* Familiarity with OpenTelemetry specification and core APIs.
From a screening perspective we recommend focusing on:
- How candidates approach software releases and validate functionality
- Their understanding of system dependencies and fault tolerance
- Experience with diagnosing and resolving production issues
- Their ability to reflect on past incidents and identify improvements
- Evidence of systems thinking and architectural awareness
View more
View less