This document tries to explain what is going on when a deployment
to mybinder.org happens. For how to do a deploy, please see how.
The deployment happens in various stages, each of which comprise of
a series of steps. Each step of the deployment is
controlled by .travis.yml, which should be considered the authoritative
source of truth for deployment. If this document disagrees with it,.travis.yml is correct!
If any of the steps in any stage fails, all following steps
are canceled and the deployment is marked as failed.
Deployment requires the following tools to be installed. Note:
since deployments are handled with Travis CI, you don’t
need them on your local computer.
mybinder.org currently runs on Google Cloud
in a Google Kubernetes Engine
cluster. We need gcloud to authenticate ourselves to this cluster.
helm is the package manager for Kubernetes. We use this for actually installing
and upgrading the various components running mybinder.org (BinderHub, JupyterHub,
extra mybinder.org-specific services, etc)
kubectl is the canonical command line client for interacting with the Kubernetes
API. We primarily use it to explicitly wait for our deployment to complete
before running our tests.
We use pytest to verify that our deployment successfully completed, by running
a series of end-to-end tests (present in the tests/ directory) against the
new deployment. This makes sure that both builds and launches are working,
and is an important part of giving us confidence to do continuous deployment.
We have a bunch of secrets (in secrets/) in our deployment - things like
cloud credentials, etc. We use git-crypt to keep them in this repository
in an encrypted form. We use the encrypted travis file
for our repository to store the git-crypt decryption key.
All of the tools above are installed. We use the before_deploy section
in .travis.yml to install these, mostly so we get nice log folding. The only exception
is the pytest installation - that is in the install section, so we can leverage
travis caching to speed up our deploys.
All Stage 1 failures can be attributed to one of the following causes:
Network connections from Travis are being flaky, leading to failed installations
This is the most likely cause of Stage 1 failures. When this happens, we have no choice
but to restart the Travis Build.
If a restart also fails, there are two possible reasons:
Travis is having some infrastructure issues. Check the Travis Status Page
to see if this is the case.
The method we are using to install any of these bits of software is
having issues - either it no longer works due to some changes to the software, or
the software installer is depending on things that are having temporary difficulties.
Look at which software installation is failing, and debug that!
The commit we are trying to deploy modified .travis.yml, and introduced a bug / typo.
The person who wrote the PR modifying .travis.yml should debug what
the error is and fix it in a subsequent PR.
The following secrets are present in encrypted form in the repository:
Secret config for the helm charts (under secrets/config). These contain various
deployment secrets for staging and prod, such as proxy tokens, registry authentication,
Google Cloud Service Accounts
for both the staging and production Google cloud projects
(as secrets/gke-auth-key-staging.json and secrets/gke-auth-key-prod.json).
These have a custom Role
called travis-deployer that gives them just the permissions needed to do
A GitHub deploy key
for the binderhub-ci-repos/requirements
repo (as secrets/binderhub-ci-key). This is used in our tests to force the
deployed binderhub to do a build + launch, rather than just a launch (via
The git-crypt symmetric key needed to decrypt these secrets is travis/crypt-key.enc,
encrypted with Travis’s encrypted file
support. Travis only supports one encrypted file per repo, and these are one-way encrypted
only (you can not get plain text back easily!), forcing us to use git-crypt.
Decrypt the git-crypt key with the travis-provided openssl command
Decrypt all other secrets with the git-crypt key
At the end of this step, all the secrets required for a successful deployment
are available in unencrypted form.
Someone has used the travis encrypt-file command for this repository, overwriting
the current travis encryption key (which is used to decrypt the git-crypt encryption
key), and committed this change. This causes issues because travis encrypt-file
can only encrypt one file per repo, so if you encrypt another file the first file
This will manifest as an error from the openssl command.
The simplest fix is to revert the PR that encrypted another file. git-crypt
should be used instead for encrypting additional files.
We use helm charts
to configure mybinder.org. We use charts both from the
official kubernetes charts repository,
as well as the JupyterHub charts repository.
To set up helm to do the deployment, we do the following:
Set up the helm client, allowing it to create the local config files it needs
Set up the JupyterHub charts repository for use with this helm installation,
and fetch the latest chart definitions.
Fetch all the dependencies of the mybinder deployment chart with the versions
specified in mybinder/requirements.yaml, and store them locally to ready them
At the end of this step, helm has been fully configured to do deployment of our
Invalid version for a dependency in mybinder/requirements.yaml
This manifests as an error from helm dep up that looks like the following:
helm dep up
Error: Can't get a valid version for repositories <dependency>. Try changing the version constraint in requirements.yaml
<dependency> in the above error message should point to the erroring dependency
whose version needs to be fixed.
If this happens for the binderhub dependency, the most common reason is that
you have not waited long enough after merging a PR in the binderhub repo before
bumping the version here. Make sure the version of binderhub is visible in
https://jupyterhub.github.io/helm-chart before merging a PR here.
We create an annotation
in Grafana, recording the fact that a deployment is starting.
This is very useful when looking at dashboards, since you can see
the effects of deployments in various metrics.
We have a staging environment that is configured
exactly like production, but smaller (to control costs). We use this to test all
deployments before they hit the production mybinder.org website.
We use the deploy.py script to do the helm deployment. This script does the
Use the Google Cloud Service Accounts we decrypted in Stage 2, Step 1 to get
a valid ~/.kube/config
file. This file is used by both helm and kubectl to access the cluster.
Use helm upgrade
to actually do the deployment. This deploys whatever changes the commit has -
new chart versions, changes to configuration, new repo2docker version, etc.
We have a ten minute timeout here.
We use kubectl rollout to wait for all Deployment
objects to be fully ready. Theoretically the --wait param to helm upgrade does
this - but it is not complete enough for our use case.
Once we have verified that all the Deployment and DaemonSet objects are ready,
the helm deployment is complete!
YAML formatting issue in one of the config files
YAML syntax can be finnicky sometimes, and fail in non-obvious ways. The most common
error is the presence of tab characters in YAML, which will make them always fail.
The Helm community has some tips
on common YAML issues. Learn X in Y Minutes also has a nice guide on YAML.
You can also use yamllint locally to validate
your YAML files.
Remember to not copy paste any secret files into online YAML Linting applications
for linting! That could possibly compromise mybinder.org.
Kubernetes cluster is having difficulties
This is usually manifested by either helm or kubectl reporting connection errors.
Bugs in helm itself
Fairly rare, but bugs in helm itself might cause failure.
Severe bugs in the version of binderhub, jupyterhub or any of the dependencies deployed.
This will usually manifest as a kubectl rollout command hanging forever. This is
caused by a bug in the component that kubectl rollout is waiting for constantly
crashing, unable to stay up.
Looking at what component it is, and perhaps in the logs, would help!
We run the tests in tests/ with pytest to validate that the deployment succeeded.
These try to be as thorough as possible, simulating the tests a human would do to
ensure that the site works as required.
Look at the docstrings in the files under tests/
to see what are the tests being run.
If all the tests succeed, we can consider the staging deployment success!
Bugs in the version of binderhub or jupyterhub deployed, causing any of the tests
in tests/ to fail.
The output should tell you which test fails. You can look at the docstring for the
failing test to understand what it is was testing, and debug from there.
After deploying to staging and validating it with tests, we have a reasonable amount of confidence
that it is safe to deploy to production. Production deploy has the exact same steps as
staging, but targets production (branch and namespace prod) instead of staging.