2018-02-22 NGINX crash#
We got stackdriver email alerts saying Binder was down, confirmed all Binder services were down (incl. mybinder.org), saw that logs were showing very little output in general and suspected it was NGINX pods. Deleted the pods, then all went back to normal.
All times in PST
Stackdriver sends emails about the outage which we notice. Check grafana as
mybinder.org and both are down.
Check the logs being generated by the hub. Very few logs in general, and most recent ones show errors with the NGINX ingress pods.
Deleted the ingress controller pods and waited for new ones to come up
Confirm that the problem is resolved, grafana/mybinder.org are now back.
What went well#
The problem was very quickly identified and resolved
What went wrong#
The alert may have only gone out to a subset of team members, so it could have been noticed earlier.
Where we got lucky#
We happened to be in a position where we could quickly debug and fix.
Make sure that everybody gets emailed when the site goes down (DONE)
Find a path forward to switch away from NGINX to better traffic tech #528