2018-03-31, Server launch failures#

Summary#

After a few days of general sub-optimal stability and some strange networking errors, a node was deleted. This caused a more general outage that was only solved by totally recycling all nodes.

link to Gitter incident start

Timeline#

All times in PST

2018-03-31 15:47#

Problem is identified

  • Launch success rate drops quickly

  • Many pods stuck in “Terminating” and “ContainerCreating” state.

  • Hub pod is showing many timeout errors.

16:11#

  • Mount errors on build pods:

Events:
  Type     Reason                 Age                From                                           Message
  ----     ------                 ----               ----                                           -------
  Normal   Scheduled              5m                 default-scheduler                              Successfully assigned build-devvyn-2daafc-2dfield-2ddata-6e8479-c1cecc to gke-prod-a-ssd-pool-32-134a959a-p2kz
  Normal   SuccessfulMountVolume  5m                 kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz  MountVolume.SetUp succeeded for volume "docker-socket"
  Warning  FailedMount            4m (x8 over 5m)    kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz  MountVolume.SetUp failed for volume "docker-push-secret" : mkdir /var/lib/kubelet/pods/e8d31d4e-3537-11e8-88bf-42010a800059: read-only file system
  Warning  FailedMount            4m (x8 over 5m)    kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz  MountVolume.SetUp failed for volume "default-token-ftskg" : mkdir /var/lib/kubelet/pods/e8d31d4e-3537-11e8-88bf-42010a800059: read-only file system
  Warning  FailedSync             55s (x24 over 5m)  kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz  Error syncing pod

17:21#

  • Decided to increase cluster size to 4, wait for new nodes to come up, then cordon the two older nodes

17:30#

  • New nodes are up, old nodes are drained

  • Hub / binder pods show up on new nodes

  • Launch success rate begins increasing

  • Launch rate goes back to 100%

Lessons learnt#

What went well#

  1. The problem was eventually resolved

What went wrong#

  1. It was difficult to debug this problem as there was no obvious error message, and the person solving the problem wasn’t sure how to debug.

Action items#

Investigation#

The outage seemed to come from the deletion of a node, but it seemed to be related to other pre-existing nodes as well. Perhaps this is a general thing that happens when nodes become too old?

What went wrong#

  • There was a major outage that we were unable to debug, there were not clear errors in the logs

  • There was only one person available to debug, which made it more difficult to know how to proceed withot any feedback

Process improvements#

  1. Improve the alerting so that a majority of the team is notified when there’s an outage. (currently blocking on #365)

  2. Come up with team guidelines for how “stale” a node can become before we intentionally recycle it. (#528)

Documentation improvements#

  1. Document how to “recycle” nodes properly (#528)