Systems Development Engineer, AWS Incident Response (AIR), AWS Incident Response (AIR)
Job Summary
As a Systems Development Engineer on the AIR team you will lead the response to critical customer-impacting events triaging impact identifying root causes coordinating mitigation actions with service teams and driving resolution in real-time. Not every event is solved by automation; you will use your technical judgment to assess situations engage the right teams and direct mitigation strategies when manual intervention is required. Insights from these events directly inform the automation and tooling you build creating a continuous improvement loop where each event makes the next one shorter or prevents it entirely.
This role offers a unique combination of systems development and real-time operational leadership with direct impact on the availability of AWS services used by millions of customers.
Key job responsibilities
Drive the resolution of large-scale customer-impacting incidents as part of an on-call rotation (including weekends and holidays) leading incident calls and coordinating resolver teams across AWS service organizations
Design build and enhance incident detection triage and mitigation automation tools
Author COEs and event deep-dive documents to identify improvement opportunities; create and lead action items that improve processes tooling and automation
Identify recurring platform issues and own projects that eliminate entire classes of operational problems
Collaborate with teams globally to expand incident response capabilities across AWS regions and services
A day in the life
A Systems Development Engineer on the AWS Incident Response (AIR) team has full visibility on all AWS services! There are limitless opportunities to learn as you will work with all AWS internal teams and have exposure to AWS products and services.
When on-call your day may start with large scale event you join the conference bridge assess the scope of impact using real-time dashboards identify impaired services engage the right teams and drive mitigation until the event is resolved. After the event you lead the deep-dive document findings and create action items to prevent recurrence.
When off-call you spend your time building and improving the tools that make incident response faster and more automated. You might be writing code to improve event detection logic building dashboards that surface the right signals during triage or working on automation that reduces manual steps during mitigation. You participate in design / code reviews and collaborate with engineers across AIR to drive operational improvements. You also invest time in learning AWS service architectures understanding how services fail helps you respond faster when they do.
About the team
AWS Incident Response (AIR) is a globally distributed team responsible for leading the large-scale customer-impacting events across AWS. We operate 24/7 providing incident leadership and coordination for events that span multiple services and regions. Our engineers combine hands-on incident leadership with systems development we build the automation and tooling we use and every event teaches us how to make the next one shorter or prevent it entirely. The team values operational excellence continuous learning and a bias for action. We work closely with service teams networking and infrastructure organizations across AWS giving our engineers broad exposure to how AWS operates under the hood.
- Knowledge of systems engineering fundamentals (networking storage operating systems)
- Experience designing or architecting (design patterns reliability and scaling) of new and existing systems
- Experience in networking storage systems operating systems and hands-on systems engineering
- Experience programming with at least one modern language such as C C# Java Python Golang PowerShell Ruby
- Experience in automating deploying and supporting large-scale infrastructure
- Experience in automation or monitoring frameworks deployment or development
- Experience that includes strong analytical skills attention to detail and effective communication abilities or experience in managing and troublshooting network
- Experience leading high-severity incident conference calls and driving resolution across multiple stakeholder teams
- Experience authoring detailed event deep-dive documents and driving action items to closure with service teams
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover invent simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice ( to know more about how we collect use and transfer the personal data of our candidates.
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.
Required Experience:
IC
About Company
Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more