2019-04-03, 30min outage during node pool upgrade#
During a Kubernetes version upgrade all nodes running our ingress-controller pods were cordoned. This went unnoticed and caused 40min of total outage.
All times in GMT+2
Start of incident. The final two nodes in the old user node pool are cordoned.
Investigation starts after a user reported that mybinder.org was down.
The ingress controller pods were deleted and rescheduled on uncordoned nodes. Service resumes. Incident ends.
What went well#
List of things that went well. For example,
service was quickly restored once outage was reported
What went wrong#
Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items.
Outage went unnoticed for 40minutes
Where we got lucky#
These are good things that happened to us but not because we had planned for them. For example,
A user reported the outage on gitter and someone was around to see it and react to it
These are only sample subheadings. Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten.
Update SRE guide to include guidance for moving ingress controller pods
Setup our ingress deployment to be robust against nodes being cordoned