2017-09-27, Hub 403#

Summary#

After a deployment, most users were getting a ‘403 Forbidden’ when attempting to launch their image. This was caused by a small bug that manifests only on multi-node kubernetes clusters. The bug was identified and fixed, but due to logistical issues it caused beta to be unusable for about 1h50m.

Timeline#

All times in PST

Sep 27 2017 13:06#

A lot of changes related to launching the user servers with hub api are deployed to staging. Testing caused a ‘403 Forbidden’ on the first try, but worked on second try. This was attributed to the mysterious ‘oh must have been stale cookies’ reason, and not counted as a problem. Deployment proceeds.

13:35#

Deployment to beta is complete! ‘403 Forbidden’ recurs once, but works the second time - again attributed to ‘stale cookies’. Some cluster maintenance work is then begun:

Make sure cluster nodes have SSDs
Make sure cluster nodes have large SSDs (500G, not 100G) for better performance, since on Google Cloud disk performance scales with disk size.

These take a while. It also looked like two of the three nodes present were having disk trouble, so it seemed a good time to roll ‘em over (without doing too much investigation).

13:44#

‘403 Forbidden’ reported by others, and it does not clear on second try. All nodes now are new, so this is unrelated to nodes having issues. Within a few minutes, it is clear that the ‘403 Forbidden’ does not go away after first use, and we should consider beta as ‘broken’ now.

14:08#

Digging into logs and attempts to reproduce this yield the following conclusions:

It works when the server starts up within 10s (thus not triggering the ‘server is slow to spawn’ message in hub logs)
It leads to 403 when the server takes longer than 10s.

A quick look at the code made the problem clear:

If the spawn takes more than 10s, the hub returns a 202 status code, and we then have to poll to see when our server has started.
In binderhub, we were not doing this polling. We instead prematurely redirect to the hub, which freaks out at the non-running server and gives us the 403. This is because we use nullauthenticator for the hub, whose sole job is to 403 every user not already logged in somehow!

This diagnosis was a little confused by the following facts:

We also upgraded from JupyterHub 0.7 to 0.8 with this deployment, and decided to just wipe out the user db than upgrade it. This caused a lot of messages of the following form:

[W 2017-09-28 16:20:51.785 JupyterHub base:350] Failed login for unknown user
[W 2017-09-28 16:20:51.787 JupyterHub log:122] 403 GET /hub/login?next=%2Fhub%2Fuser%2F13b14ca5-5c6b-4340-8157-651ad95a2739%2Fapi%2Fcontents%2Fnotebooks%2Fxwidgets.ipynb%3Fcontent%3D0%26_%3D1506507345627 (@10.100.16.12) 2.00ms

These were red herrings, just failures for people who were unceremoniously booted off during the deploy.

In cluster, most images had already been locally cached. So some images would launch under 10s, hence not causing the 403. This made tracking down cause a little harder than if it failed consistently.
This would not be reproducible on minikube without artificial constraints, since it does not use a registry & the build host is always the same as the run host.

Once it was clear what the issue was, the fix seemed obvious enough.

15:11#

A PR is sent up with the fix. It isn’t as configurable as it should be, but should work. This is delayed by the fact that the deployer’s local minikube had run out of space & needed to be reset.

15:25#

The PR has been merged and deployed, bringing the outage to an end.

Conclusion#

This was caused by two factors:

Local development environment being slightly different from production environment (single node vs multi node)
Testing in staging not thorough enough

Debugging this was complicated by the fact we changed a lot of things at the same time (JupyterHub version, binder code, etc). This was a big breaking change that required us to kick users out. We should try to not do those.

Action Items#

BinderHub#

Handle errors on launching servers on the hub more gracefully. Issue

Deployment#

Write end-to-end tests that help validate that a deployment was successful, and make sure they are part of marking a deployment green or red. Issue
Write down the responsibilities of the people doing the deployment. Specifically include the lesson of ‘never ignore any errors in staging, they will always bite you in production’ prominently. Issue
Do deployments more often so we don’t end up doing one big massive deployment ever. Those will always cause problems. Easier deployments with better tooling will help this. Issue

Local development#

Consider figuring out an easy way for people to run a development environment that is closer to production (multi-node, image registry, etc) than minikube. Issue