# 2022-01-27, pod limit reached ## Summary A bug in the GKE resource quota was preventing the `prod` hub from creating new pods. It said we had exceeded our pod quota even though we certainly had not. When we deleted the `gke-resource-quotas` `resourcequota` in k8s, the pod limit error no longer appeared and things went back to normal. This effect lasted approximately nine hours before normal operation was restored without intervention. ## Timeline All times in CET ### 2022-06-02 10:00 - Problem starts mybinder.org stopped successfully launching any new pods. ### 21:00 - Team alerted A [user reported a Binder outage in the Matrix channel](https://matrix.to/#/!FUpHWAzqkjcOgkhmHS:petrichor.me/$yvbn4wMMghzF0COGPEVckE_1i7535UL5NLRc6Xlpu34?via=petrichor.me&via=gitter.im&via=matrix.org). A team member noticed, and a quick investigation showed that pods hadn't been launching for several hours, and begin investigation. Team is alerted via the Matrix channel. ### 21:06 - Logger error We discover the following log error about hitting a pod quota limit: ``` HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"jupyter-leosamu-2dpythonmoocproblems-2drkut24bk\" is forbidden: exceeded quota: gke-resource-quotas, requested: pods=1, used: pods=15k, limited: pods=15k","reason":"Forbidden","details":{"name":"jupyter-leosamu-2dpythonmoocproblems-2drkut24bk","kind":"pods"},"code":403} ``` In particular: ``` exceeded quota: gke-resource-quotas, requested: pods=1, used: pods=15k, limited: pods=15k" ``` This is confusing because we definitely are not using 15k pods. ### 21:22 - Found a StackOverflow answer Some investigating on StackOverflow showed that others have run into similar problems. These two StackOverflow posts were helpful: - https://stackoverflow.com/questions/58716138/ - https://stackoverflow.com/a/61656760/1927102 They mentioned it was a bug in the `gke-resource-quotas` Kubernetes object, and that deleting it caused the object to be recreated and work correctly again. ### 21:24 - Delete `gke-resource-quotas` and issue goes away We deleted the `gke-resource-quotas`, and immediately our deployment was able to create pods again, launch success went back to 100% pretty quickly. ## Lessons learnt ### What went well 1. Deleting the resource quota object caused the system to correct itself very quickly. ### What went wrong 1. It was 11 hours before we realized that there was a major outage on Binder. ### Where we got lucky 1. The actual fix was relatively simple once we new to delete the right object. 1. A team member with the skills and permissions to make the change happened to be at their computer at 11pm their time. ## Action items ### Process improvements 1. Uptime and alerting issue: https://github.com/jupyterhub/mybinder.org-deploy/issues/611