2018-02-12, Hub Launch Fail#
Binder was successfully building user pods, but was then failing to direct
users to the built pods. It was fixed by deleting the
All times in PST
We realized that there’s a high usage on the mybinder deployment. Tried building a repository and it would get to the “launching” step then never proceed further. Eventually it’d return a “your image took too long to launch” error.
From the grafana board, we realized that in the “Launch Times Summary” plot we showed all pods as failing to launch.
We delete the
hub pods in the
Two people confirm that their pods now build and launch fine, Grafana also shows successful “Launch Times Summary” data.
What went well#
Once we noted the problem, it was quickly resolved.
What went wrong#
The outage was present for nearly an hour before we noticed it. This is partially because the site itself was returning no errors, only taking forever to launch.
Where we got lucky#
The solution was just “delete
hub” and the problem resolved itself.
Improve the team operations around debugging the cluster more generally. We should make sure that on average there are N>1 people around with the skills and time to debug the deployment.
Improve the language around site reliability expectations for mybinder.org, so that these kinds of outages don’t feel like we’re letting users down. (link to issue)