Software Development Manager, AWS Incident Response

Amazon Dev Center U.S., Inc.
 Seattle, WA


Job summary

AWS Incident Response is at the heart of the high availability of Amazon Web Services. We make customer impacting events shorter and less frequent by providing large scale event and incident management. Our automated tooling quickly identifies the cause of an issue and helps mitigate its impact, and much of our engineer time is spent on projects to improve the tooling and automation. We also provide manual incident management for AWS and other Amazon groups, directing the resolution of an issue with service teams, and diving deep into those events to drive improvements to the tooling. It's an exciting time to join our team as we are rapidly growing and expanding our offerings.

As a Software Development Manager on the team you will manage automated tooling roadmaps and delivery for the detection and resolution of issues within AWS and Amazon infrastructure. You will also spend a portion of your time ensuring your team efficiently directs the resolution of high visibility incidents in conference calls and virtual teams. Using data learned from those incidents you will drive further improvements into our automation, tooling, and processes so that the next event is shorter or avoided entirely. You will coordinate across project teams to expand use of our tooling to additional areas across Amazon. If you're looking for a team with great growth potential and an opportunity to make a huge impact, this is the team to join.


Define and Deliver Business Priorities

You will be a key contributor and owner of the direction of the global AWS Incident Management team. You will own design, development, test, deployment and operational capabilities in AWS Incident Response. You will define, plan, track and deliver on strategic goals for the team, while ensuring that the team remains unblocked and focused during the regular development cadence.

Performance Management/Team Health

You will own all facets of performance and career management for the team. You'll conduct regular one-on-one meetings with all team members. You will provide both technical and ‘soft skill’ mentoring in order to maintain a well-rounded, world class organization. This includes project management, quality audits and coordination of training sessions with senior-level engineers as well as day-to-day oversight of the team.

Cross-Site, Cross-Team Coordination

You will be responsible for coordinating with your counterparts to ensure that a clear communication channel exists between AWS Operations teams. You will also work closely with Systems and Network product teams to create and maintain a proper processes for monitoring and alarming on services. A portion of this process will include establishing both solid operational acceptance criteria and a concrete feedback loop for resolving deviations from that process.

Incident/Change Management

You will be the point of contact for inquiries regarding engagement processes and issues within the global Amazon platform during your team’s coverage. Responsibilities include delegation of emergent engagement issues to team members, driving initiatives regarding improvements to existing tools & processes and providing feedback on new practices & procedures in order to scale with the rapid expansion of the Amazon platform and customer base.

Recruiting and Hiring

You will take the lead in hiring quality personnel who not only fit the needs of the current organization but will allow the team to scale with platform and service growth. You will coordinate with Amazon and external recruiting staff to evaluate potential candidates, participate in initial phone screens and provide relevant guidance and feedback during on-site interview loops. You will also be responsible for ensuring that proper training takes place for all new hires.

Oncall Escalation

As a member of the management team, you will participate in an escalation oncall rotation for all related issues, including high-impact systems and network events. The manager is also expected to respond to critical issues regarding engagement and incident management on an as-needed basis.

Basic Qualifications

  • 7+ years of experience working directly within engineering teams
  • Experience partnering with product OR program management teams
  • 3+ years of people management experience, managing engineers
  • 3+ years of experience architecting and designing (architecture, design patterns, reliability and scaling) of new and current systems

Candidates must have a high degree of organization and be very detail-oriented. Must be able to interact with and influence people at all levels. Must possess excellent written and verbal communication skills and be able to interact well with peers and customers. Must have the ability to contribute to and support long-term visions and direction regarding tier one and tier two systems and networking support initiatives at Amazon. Experience in building and managing a team of strong technical people, and prior ownership of the operation of a mission-critical support team is crucial to success. The successful candidate will have a proven track record of success in delivering complex projects, including coordinating and driving issues to resolution autonomously utilizing excellent project management skills.

Preffered Qualifications

•5+ years experience managing a team in Operations.

•Strong understanding of basic operational best practices such as monitoring, alerting, deployment and change policies (ITIL a plus)

•Experience running agile frameworks or other workflow methodologies in an Operations setting.

•Experience dealing with customers during issue resolution and operating under pressure.

  • Ability to effectively operate and communicate efficiently under pressure

•Routine communication of status to senior management

•SLA definition and refinement

•Goal-setting for reduction and elimination of customer facing defects

•Participation in post-mortem analysis, including ensuring a high quality bar for analysis and follow through of consequent action items

•Passion and aptitude for data analysis

•Experience with quantitative measurement and improvement

  • Experience building services for a large scale cloud platform such as AWS.

Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, please visit