2018-02-20, JupyterLab Announcement swamps Binder#

Summary#

The JupyterLab announcement post drove a great deal of traffic to mybinder.org. This caused several outages throughout the day from varying causes.

We prepared for this by temporarily increasing the mininum number of nodes.

After a deployment, most users were getting “Failed to create temporary user for gcr.io/binder-prod/” when attempting to launch their image. This was caused by a small bug that manifests only when large numbers of users are using Binder at the same time. The bug was identified and fixed, but due to logistical issues it caused mybinder to be unusable for about 1h50m, and unstable for ~1 day.

Timeline#

All times in CET (GMT+1)

Feb 20 2018 14:17#

JupyterLab announcement blog post goes live via medium, twitter. At this point, the per-repo limit is 300.

14:55#

Autoscaling increases node count to 4 from 3 as intended. This results in a slight increase in launch backlog while the new node is prepared (and the jupyterlab image is pulled)

15:05#

Autoscaling increases node count to 5. JupyterLab is very popular! Grafana shows that JupyterLab is heading very quickly for the 300 limit. It is decided to raise the per-repo limit to 500.

15:08#

JupyterLab hits the rate limit of 300 and Grafana starts listing failed builds due to the per-repo limit. The rate limit is behaving as intended.

[Action Item] launches that are rejected due to rate limiting are registered as a ‘failed launch’ in our launch success metric. This should instead be its own label.

15:10#

PR #428 increases the per-repo limit to 500.

15:15#

Deployment of PR #428 to production fails on Travis due to a timeout waiting for helm upgrade on prod. The binder pod never became available.

The travis deployment is restarted.

15:20#

Travis deployment fails again, this time hanging during grafana annotation. It is discovered that the grafana pod is unhealthy:

$ kubectl describe pod grafana...
Events:
  Type     Reason          Age              From                                           Message
  ----     ------          ----             ----                                           -------
  Warning  Unhealthy       2m               kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Readiness probe failed: Get http://10.12.8.137:3000/login: dial tcp 10.12.8.137:3000: getsockopt: connection refused
  Warning  FailedSync      2m               kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Error syncing pod
  Warning  Failed          2m               kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Error: failed to start container "grafana": Error response from daemon: cannot join network of a non running container: 8babe89dbb28ea4c09f5490797b8bb2bd4e6298a8d79a04b8653febed86fec19
  Warning  Unhealthy       1m               kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Readiness probe failed: Get http://10.12.8.137:3000/login: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Normal   SandboxChanged  1m               kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling         1m (x4 over 4h)  kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  pulling image "grafana/grafana:4.6.3"
  Normal   Pulled          1m (x4 over 4h)  kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Successfully pulled image "grafana/grafana:4.6.3"
  Normal   Started         1m (x3 over 4h)  kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Started container
  Normal   Created         1m (x4 over 4h)  kubelet, gke-prod-a-ssd-pool-32-134a959a-vvsw  Created container

The grafana pod is deleted, as is the binder pod that failed to start.

15:32#

Travis is retried once more, and succeeds this time.

Launch success begins to climb as JupyterLab pods rise from 300 to the new limit of 500.

Grafana pod dies and is restarted multiple times. This time by Kubernetes without intervention.

16:00#

JupyterLab has hit the 500 user limit and is ????? After returning to working order, ????

16:05#

Launch success rate is at 0%. Something is wrong beyond load.

16:18#

Hub is restarted to attempt to clear out bad state via kubectl delete pod

The hub comes back and promptly culls many inactive pods. However, there appears to be a problem in the culler itself. Every cull request fails due to 400 requests asking for an already-stopping server to stop again, resulting in the culler exiting. The culler shouldn’t exit when there is an error

16:29#

restart both hub and proxy pods

Everything’s still failing. Binder requests to the Hub are failing with a timeout:

[E 180220 15:32:35 launcher:93] Error creating user jupyterlab-jupyterlab-demo-gjvedw6o: HTTP 599: Timeout while connecting

[W 180220 15:32:35 web:1588] 500 GET /build/gh/jupyterlab/jupyterlab-demo/18a9793b58ba86660b5ab964e1aeaf7324d667c8 (10.12.8.27): Failed to create temporary user for gcr.io/binder-prod/r2d-fd74043jupyterlab-jupyterlab-demo:18a9793b58ba86660b5ab964e1aeaf7324d667c8

16:33#

BinderHub is restarted, in case there is an issue in BinderHub itself.

After this restart, launches begin to succeed again. It appears that BinderHub was unable to talk to JupyterHub. It could be that the tornado connection pool was draining (this has happened before due to a tornado bug).

It could also have been a kubernetes networking issue where pod-networking is no longer working.

16:40#

Grafana pod restarted itself again. No indication as to why, but it could just be being reassigned to new nodes as the cluster resizes.

In hindsight, it is most likely because it only requests 100Mi of RAM and nothing more.

16:55#

Launches begin failing again with the same 599: Timeout errors

17:00-18:00#

Since BinderHub is reaching a timeout after several requests to the hub have accumulated

18:03#

  • cull jupyterlab pods older than 2 hours (103 pods)

  • install pycurl on binderhub, which has been known to fix some timeout issues on jupyterhub underload

  • revert per-repo limit back to 300 pods

18:10#

Travis deployment fails tests on prod:

def test_hub_up(hub_url):
        """
        JupyterHub url is up and returning sensible result (403)
        """
        resp = requests.get(hub_url)
        # 403 is expected since we are using nullauthenticator
        # FIXME: Have a dedicated health check endpoint for the hub
>       assert resp.status_code == 403
E       assert 504 == 403
E        +  where 504 = <Response [504]>.status_code

due to networking failure on a node, but we don’t know that yet.

(Note: I suspect this is due to JupyterHub being too busy, not because of node failure)

18:22#

pycurl PR is reverted due to suspicion that it caused Service Unavailable errors. It turns out this is not the case, the Hub really was unavailable due to bad networking state on at least one node.

Hub logs show:

[E 2018-02-20 17:18:21.889 JupyterHub app:1623]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1620, in launch_instance_async
        yield self.initialize(argv)
      File "/usr/lib/python3.6/types.py", line 184, in throw
        return self.__wrapped.throw(tp, *rest)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1382, in initialize
        yield self.init_spawners()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1210, in init_spawners
        self.users[orm_user.id] = user = User(orm_user, self.tornado_settings)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 178, in __init__
        self.spawners[name] = self._new_spawner(name)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 208, in _new_spawner
        spawner = spawner_class(**spawn_kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 83, in __init__
        self.api = client.CoreV1Api()
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 35, in __init__
        api_client = ApiClient()
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 67, in __init__
        self.pool = ThreadPool()
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 789, in __init__
        Pool.__init__(self, processes, initializer, initargs)
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
        self._repopulate_pool()
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
        w.start()
      File "/usr/lib/python3.6/multiprocessing/dummy/__init__.py", line 48, in start
        threading.Thread.start(self)
      File "/usr/lib/python3.6/threading.py", line 846, in start
        _start_new_thread(self._bootstrap, ())
    RuntimeError: can't start new thread

Node cordoned, as it is suspected to be the culprit. Hub pod is deleted to be reassigned to a new node

18:30#

to help recover, user pods older than one hour are deleted

18:37#

nothing is responding to requests anymore, including prometheus, grafana, hub, binder. ingress controller pods are deleted to try to help. It helps!

Previously cordoned node is drained, as it is suspected of causing widespread outage.

18:40#

Launch success rate is back to 100% after ~1hr of total downtime and ~4 hours of diminished capacity.

19:00#

JupyterLab pods once again reach 300 user limit. Things seem to behave as intended at this point.

19:33#

The repo quota is bumped to 303 in order to see if we can handle deployments under heavy load.

19:35#

Notice that the launch rate now begins falling strongly

19:40#

Note that logs show many Failed to create temporary user for errors

19:42#

Note that binder- logs also show Error creating user jupyterlab-jupyterlab-demo-7qptz8ws: HTTP 599: Connection timed out Also note that Grafana launch percentile plot has stopped reporting

19:45#

Delete the Binder pod: kubectl delete pod --namespace=prod binder-66fcc59fb9-58btx

19:52#

Note that Grafana launch percentile plot is back, launch success rate is back up. binder- pod is stable.

2X:XX (not sure exact time)#

Same connection timed out errors popping up as before - launch success rate is down to zero again.

21:57#

Realization that the hub- pod is being overwhelmed by HTTP requests due to the high traffic, and is locking up. This is causing the behavior.

22:12#

All pods are deleted

22:30#

Pods launching again, we are keeping an eye on the hub- pod CPU usage, which was really high during the spikes in traffic (~125%).

Doing this with kubectl --namespace=prod exec -it hub-989cc9bd-5qdcb /bin/bash and then running top

22:43#

CPU usage on the hub seems to be stabilized.

22:44#

Another realization: if user pods were deleted while they were still running their session, then HTTP requests would be sent to “default”, which was the hub. This was overwhelming the hub even more. We should make “default” go to a 404 page rather than hub. Response codes for the hub, note the switch from mostly 2xx to 3xx.

22:52 (maybe earlier?)#

We have the idea to delete the routing table for the hub- pod so that it reduces the HTTP requests. This is done with the following commands.

  • First, enter the hub- pod and start a python session with:

    kubectl --namespace=prod exec -it hub-989cc9bd-5qdcb /bin/bash

    then

    python

  • delete default HTTP route:

    requests.delete('http://proxy-api:8001/api/routes//', headers={'Authorization': 'token ' + os.environ['CONFIGPROXY_AUTH_TOKEN']})

  • add route for /hub/api:

    requests.post('http://proxy-api:8001/api/routes//hub/api', headers={'Authorization': 'token ' + os.environ['CONFIGPROXY_AUTH_TOKEN']}, json={'hub': True, 'target': 'http://10.15.251.161:8081', 'jupyterhub': True, 'last_activity': '2018-02-20T21:18:29.579Z'})

We decide to not experiment on live users right now. Everything has been stable once the 100-users-per-repo throttling had been re-established..

Feb 21, 13:00#

A more permanent fix for the proxy routes is applied to redirect all requests for stopped pods back to the Binder landing page (PR #441) by adding a route on /user/ in the configurable-http-proxy (CHP) routing table that is handled by an nginx instance redirecting all requests to the main deployment URL.

15:15#

/user/ route is updated to serve 404 on most URLs instead of redirecting to BinderHub, to avoid shifting request load to BinderHub.

16:20#

many pods are failing to stop, causing an increase in failures. cordon nodes -qz3m and -vvsm as likely culprits due to lots of error logs attributed to these instances in VM logs.

note: need to find a better way to identify bad nodes

Feb 22, 07:13#

It is discovered that the hub’s high CPU usage is due to a known bug in KubeSpawner with the Kubernetes Python client version 4.0 that was fixed in the jupyterhub helm chart v0.6. With kubernetes-client 4.0 Each Spawner allocates four idle threads via a ThreadPool. After some time, the total thread count gets high enough (~7000) that CPU usage get very high, even when the threads are idle.

07:18#

The jupyterhub chart version used by binderhub is updated to v0.6. Once deployed, CPU usage of the Hub returns close to zero. This high CPU usage in the Hub due to a flood of threads created by kubernetes-client is believed to be the root cause of most of our issues during this incident.

Conclusions#

The ultimate cause of this incident was a bug in a specific version JupyterHub+KubeSpawner+kubernetes-client that causes unreasonably high load for a moderate load. This bug had been fixed weeks ago in the jupyterhub chart, and was present. BinderHub was using a development version of the jupyterhub chart prior to the latest stable release.

  1. deploying a capacity increase during heavy load may not be a recipe for success, but this is inconclusive.

  2. handling of slow shutdown needs work in jupyterhub

  3. there is a bug in jupyterhub causing it to attempt to delete routes from the proxy that are not there. The resulting 404 is already fixed in jupyterhub, but the bug causing the incorrect attempt is still undiagnosed.

  4. grafana is regularly being restarted, which causes the page to be down. Since deployments now notify grafana of a deploy, this can prevent deploy success. It is a harmless failure in this case because if the grafana annotation fails, no deploy stages are attempted, so a Travis retry is quite safe.

  5. culler has an issue where it exits if its request fails with 400

  6. culler shouldn’t be making requests that fail with 400

  7. Deploying a change to prod under heavy load causes instability, in this case manifesting in new users not being created.

  8. Unclear if this instability was fixed by deleting binder-, or if this was just waiting for a change to propagate.

  9. JupyterHub was basically getting swamped because it was handling more HTTP requests by an order of magnitude or more. This was because of a few factors:

    1. The aforementioned big bump in usage

    2. The “default” route points to the hub, so when a user’s pod would get delete and they’d continue doing stuff, all resulting requests went to the hub.

    3. We don’t have a mechanism for throttling requests on the hub

    4. We only have a single hub that’s handling all HTTP requests

    5. There were cascading effects going on where errors would generate more HTTP requests that would worsen the problem.

  10. some issues may have been attributable to unhealthy nodes, but diagnosing unhealthy nodes is difficult.

Action Items#

JupyterHub#

  • Release JupyterHub 0.9 (or backport for 0.8.2), which has some known fixes for some of these bugs (https://github.com/jupyterhub/jupyterhub/issues/1676)

  • Improve handling of spawners that are slow to stop https://github.com/jupyterhub/jupyterhub/issues/1677

  • Investigating allowing deletion of users whose servers are slow to stop or fail to stop altogether https://github.com/jupyterhub/jupyterhub/issues/1677

  • implement API-only mode for use cases like Binder (https://github.com/jupyterhub/jupyterhub/issues/1675)

Zero-to-JupyterHub#

cull_idle_servers:

  • identify reason why 400 responses cause script to exit (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/522)

  • avoid 400 responses by waiting for servers to stop before deleting users (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/522)

BinderHub#

  • ensure pycurl is used, which is known to perform better with large numbers of requests than tornado’s default SimpleAsyncHTTPClient (https://github.com/jupyterhub/binderhub/pull/460)

  • Investigate timeout issue, which may be due to lack of pycurl, too many concurrent requests, or purely the overloaded Hub (https://github.com/jupyterhub/binderhub/issues/464)

  • separate rejection code/metadata for launch failures due to repo limit vs. “regular” launch failures. Note: on investigation, we already do this so launch failures should not include rejected launches.

  • Figure out if there’s a way to reduce the number of HTTP requests that are going to the JupyterHub (this became a problem w/ high load) (https://github.com/jupyterhub/binderhub/pull/461)

  • Make it possible for Binder to launch multiple JupyterHubs and direct users through those hubs in a round-robin fashion (https://github.com/jupyterhub/binderhub/issues/465)

Deployment#

  • Avoid sending requests for stopped pods to the Hub (which may overwhelm it if there’s high load) (https://github.com/jupyterhub/mybinder.org-deploy/pull/444)

  • Document ways to suspect and identify unhealthy nodes. At least some of the issues had to do with nodes that had become unhealthy, but diagnosing this was difficult. (https://github.com/jupyterhub/mybinder.org-deploy/issues/468)

  • Come up with group guidelines for deploying changes under heavy loads. (https://github.com/jupyterhub/mybinder.org-deploy/issues/466)

  • Investigate what are “expected” downtimes for a change to repo user limits, or other changes more broadly (https://github.com/jupyterhub/mybinder.org-deploy/issues/466)

  • Find a way to limit HTTP requests to the JupyterHub in cases of high load.

  • “we should also monitor and alert on jupyterhub process > 70% CPU” (monitoring done in grafana)

  • Move Grafana and other support services to an external cluster, so they are not affected by load in the main cluster. Our tools for debugging should not be affected by the bugs we are trying to debug (https://github.com/jupyterhub/mybinder.org-deploy/issues/438)

  • Document clear processes for requesting limit raises and how they should be granted https://github.com/jupyterhub/mybinder.org-deploy/issues/438

  • Fix cadvisor + prometheus setup so we properly get CPU / Memory statistics from cadvisor https://github.com/jupyterhub/mybinder.org-deploy/pull/449

  • Make all infrastructure pods be in the Guaranteed QoS so they do not get restarted when resources get scarce