2018-02-22 NGINX crash#

Summary#

We got stackdriver email alerts saying Binder was down, confirmed all Binder services were down (incl. mybinder.org), saw that logs were showing very little output in general and suspected it was NGINX pods. Deleted the pods, then all went back to normal.

Timeline#

All times in PST

2018-02-22 16:30#

Stackdriver sends emails about the outage which we notice. Check grafana as well as mybinder.org and both are down.

16:32#

Check the logs being generated by the hub. Very few logs in general, and most recent ones show errors with the NGINX ingress pods.

16:33#

Deleted the ingress controller pods and waited for new ones to come up

16:35#

Confirm that the problem is resolved, grafana/mybinder.org are now back.

Lessons learnt#

What went well#

The problem was very quickly identified and resolved

What went wrong#

The alert may have only gone out to a subset of team members, so it could have been noticed earlier.

Where we got lucky#

We happened to be in a position where we could quickly debug and fix.

Action items#

Process improvements#

Make sure that everybody gets emailed when the site goes down (DONE)

Technical improvements#

Find a path forward to switch away from NGINX to better traffic tech #528