DescriptionTheres nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the worlds most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team youwill solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure you will configure maintain monitor and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations availability reliability and scalability of your application or platform.
Job responsibilities
- Assists in operating and maintaining the managed AWS and Data platforms; provides day-to-day engineering and operational support to SRE and application teams under guidance.
- Supports platform design setup and configuration; performs workspace administration resource monitoring and basic troubleshooting for data engineering Data Science/ML and application/integration teams.
- Participates in evaluation activities with external vendors startups and internal teams; documents findings and recommendations for senior review.
- Contributes to improvements in system observability alerting and capacity planning by building dashboards updating runbooks and implementing basic automation.
- Collaborates with engineering and data teams to optimize infrastructure and deployment processes focusing on automation and operational excellence; writes and maintains scripts or pipelines following standards.
- Implements and troubleshoots software solutions; contributes to design and development tasks and escalates complex issues appropriately.
- Writes secure high-quality production code for features and fixes; performs basic peer reviews and debugs own code when needed.
- Identifies recurring issues and proposes or implements automation and remediation steps to improve operational stability of applications and systems.
- Contributes to a team culture of inclusion respect and continuous learning.
- Applies Site Reliability Engineering best practices (e.g. SLIs/SLOs error budgets incident response) with direction from senior engineers to support reliability scalability and performance of data platforms.
- Participates in incident response following established procedures; assists with root-cause analysis postmortem documentation and implementation of corrective actions.
Required qualifications capabilities and skills
- Formal training or certification on software engineering concepts and applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Experience in observability such as white and black box monitoring service level objective alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog Splunk and others
- Understanding of SRE principles including SLIs SLOs error budgets and incident management.
- Experience with monitoring tools automation frameworks and CI/CD pipelines.
- Experience writing Python applications or scripts and using automated unit testing frameworks.
- Experience with terraform development and understanding of terraform enterprise.
- Experience contributing to system design discussions application development testing and supporting operational stability.
- Familiarity with big data distributed compute frameworks such as Apache Spark AWS Glue and MapReduce.
- Strong troubleshooting analytical and communication skills.
Preferred qualifications capabilities and skills
- Familiarity with distributed systems and large-scale data processing.
- Experienced with AWS and Python
- Knowledge of containerization (Docker Kubernetes) and orchestration.
Required Experience:
IC
DescriptionTheres nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the worlds most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the AIML Data Platf...
DescriptionTheres nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the worlds most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team youwill solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure you will configure maintain monitor and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations availability reliability and scalability of your application or platform.
Job responsibilities
- Assists in operating and maintaining the managed AWS and Data platforms; provides day-to-day engineering and operational support to SRE and application teams under guidance.
- Supports platform design setup and configuration; performs workspace administration resource monitoring and basic troubleshooting for data engineering Data Science/ML and application/integration teams.
- Participates in evaluation activities with external vendors startups and internal teams; documents findings and recommendations for senior review.
- Contributes to improvements in system observability alerting and capacity planning by building dashboards updating runbooks and implementing basic automation.
- Collaborates with engineering and data teams to optimize infrastructure and deployment processes focusing on automation and operational excellence; writes and maintains scripts or pipelines following standards.
- Implements and troubleshoots software solutions; contributes to design and development tasks and escalates complex issues appropriately.
- Writes secure high-quality production code for features and fixes; performs basic peer reviews and debugs own code when needed.
- Identifies recurring issues and proposes or implements automation and remediation steps to improve operational stability of applications and systems.
- Contributes to a team culture of inclusion respect and continuous learning.
- Applies Site Reliability Engineering best practices (e.g. SLIs/SLOs error budgets incident response) with direction from senior engineers to support reliability scalability and performance of data platforms.
- Participates in incident response following established procedures; assists with root-cause analysis postmortem documentation and implementation of corrective actions.
Required qualifications capabilities and skills
- Formal training or certification on software engineering concepts and applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Experience in observability such as white and black box monitoring service level objective alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog Splunk and others
- Understanding of SRE principles including SLIs SLOs error budgets and incident management.
- Experience with monitoring tools automation frameworks and CI/CD pipelines.
- Experience writing Python applications or scripts and using automated unit testing frameworks.
- Experience with terraform development and understanding of terraform enterprise.
- Experience contributing to system design discussions application development testing and supporting operational stability.
- Familiarity with big data distributed compute frameworks such as Apache Spark AWS Glue and MapReduce.
- Strong troubleshooting analytical and communication skills.
Preferred qualifications capabilities and skills
- Familiarity with distributed systems and large-scale data processing.
- Experienced with AWS and Python
- Knowledge of containerization (Docker Kubernetes) and orchestration.
Required Experience:
IC
View more
View less