What does a MyBinder.org deployment do?

This document tries to explain what is going on when a deployment to mybinder.org happens. For how to do a deploy, please see how.

The deployment happens in various stages, each of which comprise of a series of steps. Each step of the deployment is controlled by .travis.yml, which should be considered the authoritative source of truth for deployment. If this document disagrees with it,.travis.yml is correct!

If any of the steps in any stage fails, all following steps are canceled and the deployment is marked as failed.

Stage 1: Installing deployment tools

Step 1: Install all the things!


Deployment requires the following tools to be installed. Note: since deployments are handled with Travis CI, you don’t need them on your local computer.

  1. gcloud

    mybinder.org currently runs on Google Cloud in a Google Kubernetes Engine cluster. We need gcloud to authenticate ourselves to this cluster.

  2. helm

    helm is the package manager for Kubernetes. We use this for actually installing and upgrading the various components running mybinder.org (BinderHub, JupyterHub, extra mybinder.org-specific services, etc)

  3. kubectl

    kubectl is the canonical command line client for interacting with the Kubernetes API. We primarily use it to explicitly wait for our deployment to complete before running our tests.

  4. pytest

    We use pytest to verify that our deployment successfully completed, by running a series of end-to-end tests (present in the tests/ directory) against the new deployment. This makes sure that both builds and launches are working, and is an important part of giving us confidence to do continuous deployment.

  5. git-crypt

    We have a bunch of secrets (in secrets/) in our deployment - things like cloud credentials, etc. We use git-crypt to keep them in this repository in an encrypted form. We use the encrypted travis file for our repository to store the git-crypt decryption key.

What happens

  • All of the tools above are installed. We use the before_deploy section in .travis.yml to install these, mostly so we get nice log folding. The only exception is the pytest installation - that is in the install section, so we can leverage travis caching to speed up our deploys.

What could go wrong?

All Stage 1 failures can be attributed to one of the following causes:

  1. Network connections from Travis are being flaky, leading to failed installations

    This is the most likely cause of Stage 1 failures. When this happens, we have no choice but to restart the Travis Build.

    If a restart also fails, there are two possible reasons:

    1. Travis is having some infrastructure issues. Check the Travis Status Page to see if this is the case.

    2. The method we are using to install any of these bits of software is having issues - either it no longer works due to some changes to the software, or the software installer is depending on things that are having temporary difficulties. Look at which software installation is failing, and debug that!

  2. The commit we are trying to deploy modified .travis.yml, and introduced a bug / typo.

    The person who wrote the PR modifying .travis.yml should debug what the error is and fix it in a subsequent PR.

Stage 2: Configuring deployment tools

Step 1: Decrypting secrets


The following secrets are present in encrypted form in the repository:

  1. Secret config for the helm charts (under secrets/config). These contain various deployment secrets for staging and prod, such as proxy tokens, registry authentication, etc.

  2. Google Cloud Service Accounts for both the staging and production Google cloud projects (as secrets/gke-auth-key-staging.json and secrets/gke-auth-key-prod.json). These have a custom Role called travis-deployer that gives them just the permissions needed to do deployments.

  3. A GitHub deploy key for the binderhub-ci-repos/requirements repo (as secrets/binderhub-ci-key). This is used in our tests to force the deployed binderhub to do a build + launch, rather than just a launch (via tests/test_build.py)

The git-crypt symmetric key needed to decrypt these secrets is travis/crypt-key.enc, encrypted with Travis’s encrypted file support. Travis only supports one encrypted file per repo, and these are one-way encrypted only (you can not get plain text back easily!), forcing us to use git-crypt.

What happens?

  1. Decrypt the git-crypt key with the travis-provided openssl command

  2. Decrypt all other secrets with the git-crypt key

At the end of this step, all the secrets required for a successful deployment are available in unencrypted form.

What could go wrong?

  1. Someone has used the travis encrypt-file command for this repository, overwriting the current travis encryption key (which is used to decrypt the git-crypt encryption key), and committed this change. This causes issues because travis encrypt-file can only encrypt one file per repo, so if you encrypt another file the first file becomes undecryptable.

    This will manifest as an error from the openssl command.

    The simplest fix is to revert the PR that encrypted another file. git-crypt should be used instead for encrypting additional files.

Step 2: Setting up Helm


We use helm charts to configure mybinder.org. We use charts both from the official kubernetes charts repository, as well as the JupyterHub charts repository.

What happens

To set up helm to do the deployment, we do the following:

  1. Set up the helm client, allowing it to create the local config files it needs to function.

  2. Set up the JupyterHub charts repository for use with this helm installation, and fetch the latest chart definitions.

  3. Fetch all the dependencies of the mybinder deployment chart with the versions specified in mybinder/requirements.yaml, and store them locally to ready them for deployment.

At the end of this step, helm has been fully configured to do deployment of our charts.

What could go wrong?

  1. Invalid version for a dependency in mybinder/requirements.yaml

    This manifests as an error from helm dep up that looks like the following:

    Error: Can't get a valid version for repositories <dependency>. Try changing the version constraint in requirements.yaml

    <dependency> in the above error message should point to the erroring dependency whose version needs to be fixed.

    If this happens for the binderhub dependency, the most common reason is that you have not waited long enough after merging a PR in the binderhub repo before bumping the version here. Make sure the version of binderhub is visible in https://jupyterhub.github.io/helm-chart before merging a PR here.

Step 3: Tell Grafana our deployment is starting

We create an annotation in Grafana, recording the fact that a deployment is starting.

This is very useful when looking at dashboards, since you can see the effects of deployments in various metrics.

Stage 3: Deploy to staging

We have a staging environment that is configured exactly like production, but smaller (to control costs). We use this to test all deployments before they hit the production mybinder.org website.

Step 1: Set up and do the helm upgrade

What happens

We use the deploy.py script to do the helm deployment. This script does the following:

  1. Use the Google Cloud Service Accounts we decrypted in Stage 2, Step 1 to get a valid ~/.kube/config file. This file is used by both helm and kubectl to access the cluster.

  2. Use helm upgrade to actually do the deployment. This deploys whatever changes the commit has - new chart versions, changes to configuration, new repo2docker version, etc. We have a ten minute timeout here.

  3. We use kubectl rollout to wait for all Deployment and DaemonSet objects to be fully ready. Theoretically the --wait param to helm upgrade does this - but it is not complete enough for our use case.

Once we have verified that all the Deployment and DaemonSet objects are ready, the helm deployment is complete!

What could go wrong?

  1. YAML formatting issue in one of the config files

    YAML syntax can be finnicky sometimes, and fail in non-obvious ways. The most common error is the presence of tab characters in YAML, which will make them always fail.

    The Helm community has some tips on common YAML issues. Learn X in Y Minutes also has a nice guide on YAML. You can also use yamllint locally to validate your YAML files.

    Remember to not copy paste any secret files into online YAML Linting applications for linting! That could possibly compromise mybinder.org.

  2. Kubernetes cluster is having difficulties

    This is usually manifested by either helm or kubectl reporting connection errors.

  3. Bugs in helm itself

    Fairly rare, but bugs in helm itself might cause failure.

  4. Severe bugs in the version of binderhub, jupyterhub or any of the dependencies deployed.

    This will usually manifest as a kubectl rollout command hanging forever. This is caused by a bug in the component that kubectl rollout is waiting for constantly crashing, unable to stay up.

    Looking at what component it is, and perhaps in the logs, would help!

Step 2: Validate the deployment

What happens

We run the tests in tests/ with pytest to validate that the deployment succeeded. These try to be as thorough as possible, simulating the tests a human would do to ensure that the site works as required.

Look at the docstrings in the files under tests/ to see what are the tests being run.

If all the tests succeed, we can consider the staging deployment success!

What could go wrong?

  1. Bugs in the version of binderhub or jupyterhub deployed, causing any of the tests in tests/ to fail.

    The output should tell you which test fails. You can look at the docstring for the failing test to understand what it is was testing, and debug from there.

Stage 4: Deploy to production

What happens

After deploying to staging and validating it with tests, we have a reasonable amount of confidence that it is safe to deploy to production. Production deploy has the exact same steps as staging, but targets production (branch and namespace prod) instead of staging.