# 2017-11-30 4:23PM PST, OOM (Out of Memory) Proxy ## Summary Tim and Chris noticed at 0:59 CET (need to convert to PST) that mybinder.org showed an error that creating temporary users failed when launching a new server. This was noticed because Tim happened to look at grafana. One node running a proxy had run out of memory which broke the proxy. Probably a build container had used up all the memory on the node as builds had no memory limits enforced. ## Timeline ### 2017-11-30 16:22 PST Node runs out of memory, service starts degrading. ### 16:59 PST Investigation starts after noticing grafana looks weird. Error message when launching a new server: ``` Found built image, launching... Launching server... Failed to create temporary user for gcr.io/binder-prod/r2d-fd74043wildtreetech-explore-open-data :6be6d6329de99168d0a0b68f9d52228ecc61ac50 ``` Confirmed by Tim and Chris. ### 17:02 PST Suggestion that it might be related to https://status.cloud.google.com/incident/storage/17006. All nodes show status "Ready" on "Kubernetes clusters" page of cloud console. ### 17:04 PST Chris reports that proxy pods are in CrashloopBackoff state and had been restarted ~13 times. A kubectl describe on the pod gave: ``` Normal Started 42m kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Started container Normal Created 42m (x3 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Created container Normal Started 42m (x3 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Started container Normal Pulled 42m (x3 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Container image "gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.15" already present on machine Warning BackOff 7m (x163 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Back-off restarting failed container Warning FailedSync 2m (x187 over 42m) kubelet, gke-prod-a-default-pool-6ab3d8ec-jsv7 Error syncing pod ``` and ``` Normal Created 10s (x3 over 41s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Created container Normal Started 10s (x3 over 41s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Started container Normal Pulled 10s (x3 over 41s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Container image "gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.15" already present on machine Warning Unhealthy 9s kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Readiness probe failed: Get http://10.12.5.36:10254/healthz: dial tcp 10.12.5.36:10254: getsockopt: connection refused Warning BackOff 5s (x5 over 33s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Back-off restarting failed container Warning FailedSync 5s (x5 over 33s) kubelet, gke-prod-a-default-pool-6ab3d8ec-0h28 Error syncing pod ``` and logs gave: `Error from server (BadRequest): a container name must be specified for pod proxy-fd9c4f59b-pjhth, choose one of: [nginx chp]` ### 17:07 Tim asks if "some of them are still running?". Chris says that pods are running but no new ones are being created. ### 17:09 jupyterhub pod shows lots of DELETE requests that are timing out and tracebacks. Tim suggests restarting jupyterhub pod. ### 17:12 Tim asks how to contact Yuvi. Tim restarts jhub pod with `kubectl get pod hub-deployment-944c6474c-22sjh -n prod -o yaml | kubectl replace --force -f -` This restarts the jupyterhub, removes all user pods, error persists. ### 17:14 Chris runs `kubectl describe` on the node on which the proxy runs which gives: ``` Warning OOMKilling 58m kernel-monitor, gke-prod-a-default-pool-6ab3d8ec-0h28 Memory cgroup out of memory: Kill process 8339 (python) score 1576 or sacrific ``` ### 17:18 Chris suggests to try cordoning the OOM node and then deleting the proxy pod. ### 17:20 Chris notices that proxy pod is working again. Service appears to be restored. proxy pod had been running for 10m already by this point (since ~15:10). Chris uncordons node again. ### 17:21 PST Service is restored, user pods start running again. ## Action items ### Repo2docker Limit memory available to build pods. (https://github.com/jupyterhub/repo2docker/pull/159) ### Jupyterhub Fix configuration so that user pods are not cleaned up when jupyterhub pod restarts (https://github.com/jupyterhub/mybinder.org-deploy/issues/198) ### Process 1. Should nominate one person to be the driver, maybe a shared terminal. So that only one person is issuing commands to change the state of the cluster and the other person knows about them. 2. Better alerting for when something is wrong with the cluster. This one could have been alerted by counting 4xx responses on the jupyterhub pod. ## Logs Trying to collect to gather logs looked at during the incident: - https://hackmd.io/JwBgjAhgTARgHAZgLRjlMSAsBjCB2JGAEzwiTgFYwKBTNANnoDNwg===?edit (bhub pod log starting from before the problem was noticed) - https://console.cloud.google.com/logs/viewer?project=binder-prod&minLogLevel=0&expandAll=false×tamp=2017-11-30T23%3A22%3A18.000000000Z&resource=container%2Fcluster_name%2Fprod-a%2Fnamespace_id%2Fprod&advancedFilter=resource.type%3D%22container%22%0Aresource.labels.pod_id%3D%22prod-nginx-ingress-controller-7b4fdbdcc8-qp6jk%22%0Aresource.labels.zone%3D%22us-central1-a%22%0Aresource.labels.project_id%3D%22binder-prod%22%0Aresource.labels.cluster_name%3D%22prod-a%22%0Aresource.labels.container_name%3D%22nginx-ingress-controller%22%0Aresource.labels.namespace_id%3D%22prod%22%0Aresource.labels.instance_id%3D%224874803091664833146%22%0Atimestamp%3D%222017-11-30T23%3A22%3A18.000000000Z%22%0AinsertId%3D%221xxo64cg1cnzk5b%22 weird things start showing up in the log around here - gitter chat log https://gitter.im/jupyterhub/binder?at=5a209b75ffa3e37919df1c15 - grafana screenshots https://github.com/jupyterhub/binder/issues/39