2018-07-30 JupyterLab builds saturate BinderHub CPU#

Summary#

Binder wasn’t properly building pods and launches weren’t working. It was decided that:

The jupyterlab-demo repo updated itself, triggering a build
The update to jupyterlab-demo installed a newer version of JupyterLab
repo2docker needed a loooong time installing this (perhaps because of webpack size issues)
Since the repository gets a lot of traffic, each request to launch while the build is still happening eats up CPU in the Binder pod
The Binder pod was thus getting saturated and behaving strangely, causing the outage
Banning the jupyterlab-demo repository resolved the CPU saturation issue.

Timeline#

All times in PST (UTC-7)

2018-07-30 ca. 11:20#

We notice that launches are not happening as expected. Cluster utilization is very low, suggesting that pods aren’t being created.

11:22#

Notice an SSL protocol error:

tornado.curl_httpclient.CurlError: HTTP 599: Unknown SSL protocol error in connection to gcr.io:443

Binder pod is deleted and launches return to normal.

12:19#

Launches aren’t working again, taking a very long time to start up
Deleted binder and hub pods
This resolved the issue a second time.

This is the utilization behavior seen:

13:29#

Behavior is once again going wrong. Launches taking forever to load. We note a lot of networky-looking problems in the logs.

13:41#

Deleted several evicted pods. Pods are often evicted because of low resources available and kubelet evicts in order to free up resources for more important services.

14:01#

Confirm that networking seems to be find between production pods.

Gitter link

14:16#

Note that the CPU utilization of the BinderHub pod is at 100%. If we restart Binder pod, the new one gradually increases CPU utilization until it hits 100%, then problems begin.

This explains the short-term fixes of deleting the binder pod from before.

14:45#

We realize that the jupyterlab-demo repository has been updated and has a lot of traffic. This seems to be causing strange behavior because it is still building.

Gitter link

15:11#

jupyterlab-demo repository is banned, and behavior subsequently returns to normal.

Post-mortem suggests this is the problem:

The jupyterlab-demo repo updated itself, triggering a build
The update to jupyterlab-demo installed a newer version of JupyterLab
repo2docker needed a loooong time installing this (perhaps because of webpack size issues)
Since the repository gets a lot of traffic, each request to launch while the build is still happening eats up CPU in the Binder pod
The Binder pod was thus getting saturated and behaving strangely, causing the outage
Banning the jupyterlab-demo repository resolved the CPU saturation issue.

Lessons learned#

What went well#

the binder team did a great job of distributed debugging and got this fixed relatively quickly once the error was spotted!

What went wrong#

It took a while before we realized launch behavior was going wonky. We really could use a notifier for the team :-/

Action items#

These are only sample subheadings. Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten.

Process improvements#

set up notifications of downtime (issue)

Technical improvements#

Find a way to gracefully handle repositories that take a long time to build (https://github.com/jupyterhub/binderhub/issues/624)
Find a way to avoid overloading the Binder CPU when a repository is building and also getting a lot of traffic at the same time. (https://github.com/jupyterhub/binderhub/issues/624)