# 2018-02-22 NGINX crash

## Summary

We got stackdriver email alerts saying Binder was down, confirmed all Binder
services were down (incl. mybinder.org), saw that logs were showing very little
output in general and suspected it was NGINX pods. Deleted the pods, then
all went back to normal.

## Timeline

All times in PST

### 2018-02-22 16:30

Stackdriver sends emails about the outage which we notice. Check grafana as
well as `mybinder.org` and both are down.

### 16:32

Check the logs being generated by the hub. Very few logs in general, and most
recent ones show errors with the NGINX ingress pods.

### 16:33

Deleted the ingress controller pods and waited for new ones to come up

### 16:35

Confirm that the problem is resolved, grafana/mybinder.org are now back.

## Lessons learnt

### What went well

1. The problem was very quickly identified and resolved

### What went wrong

1. The alert may have only gone out to a subset of team members, so it could
   have been noticed earlier.

### Where we got lucky

1. We happened to be in a position where we could quickly debug and fix.

## Action items

### Process improvements

1. Make sure that everybody gets emailed when the site goes down (DONE)

### Technical improvements

1. Find a path forward to switch away from NGINX to better traffic tech [#528](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/528)