2017-09-29, 504#

Summary#

mybinder.org was failing to launch any images. Building worked fine, but launching would fail with 504 Timeout on /run?, served by nginx. Further, not visible to users, some idle users were unable to be deleted.

Timeline#

All times in CEST

Sep 29 2017 08:59#

https://github.com/jupyterhub/binderhub/issues/140 is opened, reporting failure to build due to “Stream disconnection”

14:00#

while investigating the above issue, test builds were performed of the affected repo. Builds succeeded (failing to reproduce the issue), but launching the image failed with nginx 504 Timeout. No image could be launched.

After retrieving the pods for the Hub log, several errors were discovered:

  • The Spawner.start() method is not returning, as indicated by the logs:

      binder-testing-xyz's server failed to start in 300 seconds, giving up
    

This is the first stage in launching, and purely under KubeSpawner’s control. JupyterHub does not get a chance to proceed to check if the server is running.

  • Inspecting the pod that failed to start reveals that it is indeed running and responsive.

At the same time, other errors are in the logs:

  • Some Spawners had entered a permanent stop_pending state, as indicated by the cull-idle service failing to delete users because they were ‘pending stop’. Logs:

      400 DELETE /hub/api/users/binder-testing-chcr2h3j (1.2.3.4): binder-testing-chcr2h3j's server is in the process of stopping, please wait.
    
  • inspecting the logs reveals that the pod has been deleted, but Spawner.stop has not returned.

  • I believe this has been reported before here: https://github.com/jupyterhub/jupyterhub/issues/1420

I believe all of these are reflecting a bug in the KubeSpawner reflector missing events and never recovering (see https://github.com/jupyterhub/kubespawner/pull/81 for a possible fix).

Additionally, there were several errors related to the reflector connection, which are likely related to the unhandled events:

2017-09-29 00:46:54,810 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd66401bd68>: Failed to establish a new connection: [Errno 111] Connection refused',)': /api/v1/namespaces/beta/pods?labelSelector=heritage%3Djupyterhub%2Ccomponent%3Dsingleuser-server
[E 2017-09-29 00:46:54.814 JupyterHub reflector:113] Error when watching pods, retrying in 25.6s
<snip>
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='x.x.x.x', port=443): Max retries exceeded with url: /api/v1/namespaces/beta/pods?labelSelector=heritage%3Djupyterhub%2Ccomponent%3Dsingleuser-server (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd666225748>: Failed to establish a new connection: [Errno 111] Connection refused',))

14:10#

hub pod is relaunched to clear invalid state. Service is back to normal, though no actual fixes have been applied. The same issue is expected to return.

Action Items#

BinderHub#

  1. Removing the redirect page will get rid of the 504 timeout issue by redirecting directly from the event-streams on the build page Pull Request.

  2. Diagnose the original issue, which still has not been identified.

KubeSpawner#

  1. Fix known issue of PodReflector missing events Pull Request.

  2. Investigate further possibilities of reflector failure and recovery Issue

  3. Unrecoverable PodReflector errors should abort the Hub Pull Request.

The primary cause of these issues is the loss of events on the KubeSpawner PodReflector. https://github.com/jupyterhub/kubespawner/pull/81 fixes at least one known case.

Deployment#

  1. Implement health monitoring and alerting to respond more quickly to these problems Issue.

  2. Automatic recovery could potentially relaunch the Hub pod when it enters an unhealthy state.