Incident reporting#
This page contains information and guidelines for how the Binder team handles incidents and incident reports. Remember, incidents are opportunities to learn!
Principles and guidelines for incident reporting#
Inspiration for our guidelines: Google SRE guide, Managing Incidents.
Team management and takeaways from incidents: Etsy Debriefing Facilitation Guide.
Example template for incident report#
Incident history#
(in reverse chronological order)
- Template for reports
- {{ incident date: yyyy-mm-dd }}, {{ incident name }}
- 2022-01-27, pod limit reached
- 2022-01-27, stale prime version
- 2020-07-09, Simultaneous launches (aka, SciPy gives Binder a lot of hugs at the same time)
- 2019-04-03, 30min outage during node pool upgrade
- 2019-03-24, repo2docker upgrade and docker image cache wipe
- incident date: 2019-02-20, kubectl logs unavailable
- 2018-07-30 JupyterLab builds saturate BinderHub CPU
- 2018-07-08, too many pods
- 2018-04-18, Culler flood
- 2018-03-31, Server launch failures
- 2018-03-26, “no space left on device”
- 2018-03-13, PVC for hub is locked
- 2018-02-22 NGINX crash
- 2018-02-20, JupyterLab Announcement swamps Binder
- 2018-02-12, Hub Launch Fail
- 2018-01-11, Warning from letsencrypt about outdated SSL certificate
- 2018-01-18, reddit hugs mybinder
- 2018-01-17, Emergency Aardvark bump
- 2018-01-04, Failed deploy to staging
- 2017-11-30 4:23PM PST, OOM (Out of Memory) Proxy
- 2017-10-17, Cluster Full
- 2017-09-29, 504
- 2017-09-27, Hub 403