2017-11-30 4:23PM PST, OOM (Out of Memory) Proxy#

Summary#

Tim and Chris noticed at 0:59 CET (need to convert to PST) that mybinder.org showed an error that creating temporary users failed when launching a new server. This was noticed because Tim happened to look at grafana. One node running a proxy had run out of memory which broke the proxy. Probably a build container had used up all the memory on the node as builds had no memory limits enforced.

Timeline#

2017-11-30 16:22 PST#

Node runs out of memory, service starts degrading.

16:59 PST#

Investigation starts after noticing grafana looks weird. Error message when launching a new server:

Found built image, launching...
Launching server...
Failed to create temporary user for gcr.io/binder-prod/r2d-fd74043wildtreetech-explore-open-data
:6be6d6329de99168d0a0b68f9d52228ecc61ac50

Confirmed by Tim and Chris.

17:02 PST#

Suggestion that it might be related to https://status.cloud.google.com/incident/storage/17006. All nodes show status “Ready” on “Kubernetes clusters” page of cloud console.

17:04 PST#

Chris reports that proxy pods are in CrashloopBackoff state and had been restarted ~13 times. A kubectl describe on the pod gave:

Normal   Started                42m                 kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7  Started container
  Normal   Created                42m (x3 over 42m)   kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7  Created container
  Normal   Started                42m (x3 over 42m)   kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7  Started container
  Normal   Pulled                 42m (x3 over 42m)   kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7  Container image "gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.15" already present on machine
  Warning  BackOff                7m (x163 over 42m)  kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7  Back-off restarting failed container
  Warning  FailedSync             2m (x187 over 42m)  kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7  Error syncing pod

and

 Normal   Created                10s (x3 over 41s)  kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28  Created container
  Normal   Started                10s (x3 over 41s)  kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28  Started container
  Normal   Pulled                 10s (x3 over 41s)  kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28  Container image "gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.15" already present on machine
  Warning  Unhealthy              9s                 kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28  Readiness probe failed: Get http://10.12.5.36:10254/healthz: dial tcp 10.12.5.36:10254: getsockopt: connection refused
  Warning  BackOff                5s (x5 over 33s)   kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28  Back-off restarting failed container
  Warning  FailedSync             5s (x5 over 33s)   kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28  Error syncing pod

and logs gave:

Error from server (BadRequest): a container name must be specified for pod proxy-fd9c4f59b-pjhth, choose one of: [nginx chp]

17:07#

Tim asks if “some of them are still running?”. Chris says that pods are running but no new ones are being created.

17:09#

jupyterhub pod shows lots of DELETE requests that are timing out and tracebacks. Tim suggests restarting jupyterhub pod.

17:12#

Tim asks how to contact Yuvi. Tim restarts jhub pod with kubectl get pod hub-deployment-944c6474c-22sjh -n prod -o yaml | kubectl replace --force -f - This restarts the jupyterhub, removes all user pods, error persists.

17:14#

Chris runs kubectl describe on the node on which the proxy runs which gives:

Warning OOMKilling 58m kernel-monitor, gke-prod-a-default-pool-6ab3d8ec-0h28 Memory cgroup out of memory: Kill process 8339 (python) score 1576 or sacrific

17:18#

Chris suggests to try cordoning the OOM node and then deleting the proxy pod.

17:20#

Chris notices that proxy pod is working again. Service appears to be restored. proxy pod had been running for 10m already by this point (since ~15:10). Chris uncordons node again.

17:21 PST#

Service is restored, user pods start running again.

Action items#

Repo2docker#

Limit memory available to build pods. (https://github.com/jupyterhub/repo2docker/pull/159)

Jupyterhub#

Fix configuration so that user pods are not cleaned up when jupyterhub pod restarts (https://github.com/jupyterhub/mybinder.org-deploy/issues/198)

Process#

Should nominate one person to be the driver, maybe a shared terminal. So that only one person is issuing commands to change the state of the cluster and the other person knows about them.
Better alerting for when something is wrong with the cluster. This one could have been alerted by counting 4xx responses on the jupyterhub pod.

Logs#

Trying to collect to gather logs looked at during the incident:

https://hackmd.io/JwBgjAhgTARgHAZgLRjlMSAsBjCB2JGAEzwiTgFYwKBTNANnoDNwg===?edit (bhub pod log starting from before the problem was noticed)
https://console.cloud.google.com/logs/viewer?project=binder-prod&minLogLevel=0&expandAll=false&timestamp=2017-11-30T23%3A22%3A18.000000000Z&resource=container%2Fcluster_name%2Fprod-a%2Fnamespace_id%2Fprod&advancedFilter=resource.type%3D%22container%22%0Aresource.labels.pod_id%3D%22prod-nginx-ingress-controller-7b4fdbdcc8-qp6jk%22%0Aresource.labels.zone%3D%22us-central1-a%22%0Aresource.labels.project_id%3D%22binder-prod%22%0Aresource.labels.cluster_name%3D%22prod-a%22%0Aresource.labels.container_name%3D%22nginx-ingress-controller%22%0Aresource.labels.namespace_id%3D%22prod%22%0Aresource.labels.instance_id%3D%224874803091664833146%22%0Atimestamp%3D%222017-11-30T23%3A22%3A18.000000000Z%22%0AinsertId%3D%221xxo64cg1cnzk5b%22 weird things start showing up in the log around here
gitter chat log https://gitter.im/jupyterhub/binder?at=5a209b75ffa3e37919df1c15
grafana screenshots https://github.com/jupyterhub/binder/issues/39