2022-01-27, pod limit reached#

Summary#

A bug in the GKE resource quota was preventing the prod hub from creating new pods. It said we had exceeded our pod quota even though we certainly had not. When we deleted the gke-resource-quotas resourcequota in k8s, the pod limit error no longer appeared and things went back to normal.

This effect lasted approximately nine hours before normal operation was restored without intervention.

Timeline#

All times in CET

2022-06-02 10:00 - Problem starts#

mybinder.org stopped successfully launching any new pods.

21:00 - Team alerted#

A user reported a Binder outage in the Matrix channel. A team member noticed, and a quick investigation showed that pods hadn’t been launching for several hours, and begin investigation. Team is alerted via the Matrix channel.

21:06 - Logger error#

We discover the following log error about hitting a pod quota limit:

 HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"jupyter-leosamu-2dpythonmoocproblems-2drkut24bk\" is forbidden: exceeded quota: gke-resource-quotas, requested: pods=1, used: pods=15k, limited: pods=15k","reason":"Forbidden","details":{"name":"jupyter-leosamu-2dpythonmoocproblems-2drkut24bk","kind":"pods"},"code":403}

In particular:

exceeded quota: gke-resource-quotas, requested: pods=1, used: pods=15k, limited: pods=15k"

This is confusing because we definitely are not using 15k pods.

21:22 - Found a StackOverflow answer#

Some investigating on StackOverflow showed that others have run into similar problems. These two StackOverflow posts were helpful:

  • https://stackoverflow.com/questions/58716138/

  • https://stackoverflow.com/a/61656760/1927102

They mentioned it was a bug in the gke-resource-quotas Kubernetes object, and that deleting it caused the object to be recreated and work correctly again.

21:24 - Delete gke-resource-quotas and issue goes away#

We deleted the gke-resource-quotas, and immediately our deployment was able to create pods again, launch success went back to 100% pretty quickly.

Lessons learnt#

What went well#

  1. Deleting the resource quota object caused the system to correct itself very quickly.

What went wrong#

  1. It was 11 hours before we realized that there was a major outage on Binder.

Where we got lucky#

  1. The actual fix was relatively simple once we new to delete the right object.

  2. A team member with the skills and permissions to make the change happened to be at their computer at 11pm their time.

Action items#

Process improvements#

  1. Uptime and alerting issue: https://github.com/jupyterhub/mybinder.org-deploy/issues/611