2017-11-30 4:23PM PST, OOM (Out of Memory) Proxy#
Summary#
Tim and Chris noticed at 0:59 CET (need to convert to PST) that mybinder.org showed an error that creating temporary users failed when launching a new server. This was noticed because Tim happened to look at grafana. One node running a proxy had run out of memory which broke the proxy. Probably a build container had used up all the memory on the node as builds had no memory limits enforced.
Timeline#
2017-11-30 16:22 PST#
Node runs out of memory, service starts degrading.
16:59 PST#
Investigation starts after noticing grafana looks weird. Error message when launching a new server:
Found built image, launching...
Launching server...
Failed to create temporary user for gcr.io/binder-prod/r2d-fd74043wildtreetech-explore-open-data
:6be6d6329de99168d0a0b68f9d52228ecc61ac50
Confirmed by Tim and Chris.
17:02 PST#
Suggestion that it might be related to https://status.cloud.google.com/incident/storage/17006. All nodes show status “Ready” on “Kubernetes clusters” page of cloud console.
17:04 PST#
Chris reports that proxy pods are in CrashloopBackoff state and had been restarted ~13 times. A kubectl describe on the pod gave:
Normal Started 42m kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Started container
Normal Created 42m (x3 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Created container
Normal Started 42m (x3 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Started container
Normal Pulled 42m (x3 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Container image "gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.15" already present on machine
Warning BackOff 7m (x163 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Back-off restarting failed container
Warning FailedSync 2m (x187 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Error syncing pod
and
Normal Created 10s (x3 over 41s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Created container
Normal Started 10s (x3 over 41s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Started container
Normal Pulled 10s (x3 over 41s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Container image "gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.15" already present on machine
Warning Unhealthy 9s kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Readiness probe failed: Get http://10.12.5.36:10254/healthz: dial tcp 10.12.5.36:10254: getsockopt: connection refused
Warning BackOff 5s (x5 over 33s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Back-off restarting failed container
Warning FailedSync 5s (x5 over 33s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Error syncing pod
and logs gave:
Error from server (BadRequest): a container name must be specified for pod proxy-fd9c4f59b-pjhth, choose one of: [nginx chp]
17:07#
Tim asks if “some of them are still running?”. Chris says that pods are running but no new ones are being created.
17:09#
jupyterhub pod shows lots of DELETE requests that are timing out and tracebacks. Tim suggests restarting jupyterhub pod.
17:12#
Tim asks how to contact Yuvi. Tim restarts jhub pod with kubectl get pod hub-deployment-944c6474c-22sjh -n prod -o yaml | kubectl replace --force -f -
This restarts the jupyterhub, removes all user pods, error persists.
17:14#
Chris runs kubectl describe
on the node on which the proxy runs which gives:
Warning OOMKilling 58m kernel-monitor, gke-prod-a-default-pool-6ab3d8ec-0h28 Memory cgroup out of memory: Kill process 8339 (python) score 1576 or sacrific
17:18#
Chris suggests to try cordoning the OOM node and then deleting the proxy pod.
17:20#
Chris notices that proxy pod is working again. Service appears to be restored. proxy pod had been running for 10m already by this point (since ~15:10). Chris uncordons node again.
17:21 PST#
Service is restored, user pods start running again.
Action items#
Repo2docker#
Limit memory available to build pods. (https://github.com/jupyterhub/repo2docker/pull/159)
Jupyterhub#
Fix configuration so that user pods are not cleaned up when jupyterhub pod restarts (https://github.com/jupyterhub/mybinder.org-deploy/issues/198)
Process#
Should nominate one person to be the driver, maybe a shared terminal. So that only one person is issuing commands to change the state of the cluster and the other person knows about them.
Better alerting for when something is wrong with the cluster. This one could have been alerted by counting 4xx responses on the jupyterhub pod.
Logs#
Trying to collect to gather logs looked at during the incident:
https://hackmd.io/JwBgjAhgTARgHAZgLRjlMSAsBjCB2JGAEzwiTgFYwKBTNANnoDNwg===?edit (bhub pod log starting from before the problem was noticed)
https://console.cloud.google.com/logs/viewer?project=binder-prod&minLogLevel=0&expandAll=false×tamp=2017-11-30T23%3A22%3A18.000000000Z&resource=container%2Fcluster_name%2Fprod-a%2Fnamespace_id%2Fprod&advancedFilter=resource.type%3D%22container%22%0Aresource.labels.pod_id%3D%22prod-nginx-ingress-controller-7b4fdbdcc8-qp6jk%22%0Aresource.labels.zone%3D%22us-central1-a%22%0Aresource.labels.project_id%3D%22binder-prod%22%0Aresource.labels.cluster_name%3D%22prod-a%22%0Aresource.labels.container_name%3D%22nginx-ingress-controller%22%0Aresource.labels.namespace_id%3D%22prod%22%0Aresource.labels.instance_id%3D%224874803091664833146%22%0Atimestamp%3D%222017-11-30T23%3A22%3A18.000000000Z%22%0AinsertId%3D%221xxo64cg1cnzk5b%22 weird things start showing up in the log around here
gitter chat log https://gitter.im/jupyterhub/binder?at=5a209b75ffa3e37919df1c15
grafana screenshots https://github.com/jupyterhub/binder/issues/39