mybinder.org runs on Google Cloud currently.
This document lists the various cloud products we use, and how we use them.
We use only commodity cloud products - things that can be easily
replicated in other clouds and bare-metal hardware. This gives us
several technical and social advantages:
We avoid vendor lock-in, and can migrate providers if need be
for any reason easily.
It makes our infrastructure easily reproducible by others,
who might have different resources available to them. This is
much harder if we have a hard dependency on any single cloud-provider’s
We can more easily contribute back to the open-source community.
Most such commodity products are open source, or have binary-
compatible open source implementations available. This allows us
to file and fix bugs in other Open Source Software for the benefit
of everyone, rather than just a particular cloud provider’s implementation.
Local testing when a core component depends on a cloud provider’s
product is usually very difficult. Constraining ourselves to commodity
products only makes this easier.
As an example, using PostgreSQL via Google Cloud SQL
would be fine since anyone can run PostgreSQL. But using something like
Google Cloud Spanner or
Google Cloud PubSub is something to be
avoided, since these can not be run without also being on Google Cloud.
Similarly, using Google Cloud LoadBalancing
is also perfectly fine, since a lot of open source solutions (HAProxy, Envoy, nginx, etc)
can be used to provide the same service.
We have a project
that runs on Google Cloud: binderhub. It contains the following two clusters:
prod, runs production - mybinder.org and all resources
needed for it.
staging runs staging - staging.mybinder.org and all resources
needed for it.
We try to make staging and prod be as similar as possible. Staging should
be smaller and use fewer resources. Everything we describe below
is present in staging too.
The open source Kubernetes project is used to run
all our code. Google Kubernetes Engine (GKE)
is the google hosted version of Kubernetes. It is very close to what is shipped
as Open Source, and does not have much in the way of proprietary enhancements.
In production, the cluster is called prod. In staging, it is called staging.
GKE has the concept of a NodePool
that specifies the kind of machine (RAM, CPU, Storage) we want to use for our Kubernetes
cluster. If we want to change the kind of machines we use, we can create a new NodePool,
cordon the current one, wait for all pods in current nodes to die, and then delete the
The prod cluster currently uses n1-highmem-32 machines. These have
32 CPU cores and 208 GB of Memory. We use the highmem machines (with more Memory per CPU)
as opposed to standard machines for the following reasons:
Memory is an incompressible resource - once you give a process memory, you can
not take it away without killing the process. CPU is compressible - you can
take away how much CPU a process is using without killing it.
Our users generally seem to be running not-very-cpu-intensive code, as can be
witnessed from our generally low CPU usage.
Docker layer caching gives us massive performance boosts - less time spent
pulling images leads to faster startup times for users. Using larger nodes
increases the cache hit rate, so we use nodes with more rather than less RAM.
tl;dr: Using highmem machines saves us a lot of money, since we are not paying for CPU
we are not using!
The staging cluster uses much smaller machines than the production one, to keep costs
In prod, we use 1000 GB SSD disks as boot disks. On Google Cloud, the size of
the disk controls the performance - larger the disk, the faster it is. Our disks need to be fast since we
are doing a lot of I/O operations during docker build / push / pull / run, so we
Note that SSD boot disks are not a feature available on GKE to all customers -
we have been given early access
to this feature, since it makes a dramatic difference to our performance (and
we knew where to ask!).
Staging does not use SSD boot disks.
We use the GKE Cluster Autoscaler
feature to add more nodes when we run out of resources. When the cluster is 100%
full, the cluster autoscaler adds a new node to handle more pods. However,
there is no way to make the autoscaler kick in at 80% of 90% utilization
(bug), so this leads
to launch failures
for a short time when a new node comes up.
The autoscaler can be set to have a minimum number of nodes and a maximum number
A core part of mybinder.org is building, storing and then running docker images
(built by repo2docker). Docker images
are generally stored in a docker registry,
using a well-defined standard API.
We use Google Cloud’s hosted docker registry - Google Container Registry (GCR).
This lets us use a standard mechanism for storing and retrieving docker images
without having to run any of that infrastructure ourselves.
GCR is private by default, and can be only used from inside the Google Cloud project
the registry is located in. When using GKE, the authentication for pulling images
to run is already set up for us, so we do not need to do anything special. For pushing
images, we authenticate via a service account.
You can find this service account credential under registry in secrets/config/prod.yaml
The images are scoped per-project, the images made by mybinder.org are
stored in the binder-prod project, and the images made by staging.mybinder.org
are stored in the binder-staging project.
We do not allow users to pull our images, for a few reasons:
We pay network egress costs when images are used outside the project they are in,
and this can become very costly!
This can be abused to treat us as a content redistributor - build
an image with content you want, and then just pull the image from elsewhere. This
makes us a convenient possible hop in cybercrime / piracy / other operations,
complicates possible DMCA
/ GDPR compliance and
probably a bunch of other bad things we do not have the imagination to foresee.
We might decide to clean up old images when we no longer need them, and this might
break other users who might depend on this.
For users who want access to a docker image similar to how it is built with Binder,
we recommend using repo2docker to build
your own, and push it to a registry of your choice.
Since building an image takes a long time, we would like to re-use images as much
as possible. If we have built an image once for a particular repository at a particular
commit, we want to not rebuild that image - just re-use it as much as possible.
We generate an image name for each image we build that is uniquely derived from
the name of the repository + the commit hash. This lets us check if
the image has already been built, and if so, we can skip the building step.
The code for generating the image name from repository information is
in binderhub’s builder.py,
Sometimes, we do want to invalidate all previously built images - for example,
when we do a major notebook version bump. This will cause all repositories to be
rebuilt the next time they are launched. There is a performance cost to this, so
this invalidation has to be done judiciously. This is done by giving all the images
a prefix (binderhub.registry.prefix in config/prod.yaml and config/staging.yaml).
Changing this prefix will invalidate all existing images.
We use Google Stackdriver for logging
activity on the Kubernetes deployment. This is useful for listing the raw
logs coming out of BinderHub, though we don’t use it for dashboarding (see below).
We use prometheus for collecting more fine-grained metrics about
what’s happening on the deployment, and grafana for generating
dashboards using the data from prometheus.
We use Google Analytics to keep
track of activity on the mybinder.org site, though note that we lose track
of users as soon as they are directed to their Binder session.