2019-04-03, 30min outage during node pool upgrade#


During a Kubernetes version upgrade all nodes running our ingress-controller pods were cordoned. This went unnoticed and caused 40min of total outage.


All times in GMT+2

2019-04-03 12:50#

Start of incident. The final two nodes in the old user node pool are cordoned.


Investigation starts after a user reported that mybinder.org was down.


The ingress controller pods were deleted and rescheduled on uncordoned nodes. Service resumes. Incident ends.

Lessons learnt#

What went well#

service was quickly restored once outage was reported

  1. service was quickly restored once outage was reported

What went wrong#

Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items.

  1. Outage went unnoticed for 40minutes

  1. Outage went unnoticed for 40minutes

Where we got lucky#

A user reported the outage on gitter and someone was around to see it and react to it

  1. A user reported the outage on gitter and someone was around to see it and react to it

Action items#

Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten.

Technical improvements#

  1. Update SRE guide to include guidance for moving ingress controller pods

  2. Setup our ingress deployment to be robust against nodes being cordoned