2018-03-31, Server launch failures#
Summary#
After a few days of general sub-optimal stability and some strange networking errors, a node was deleted. This caused a more general outage that was only solved by totally recycling all nodes.
Timeline#
All times in PST
2018-03-31 15:47#
Problem is identified
Launch success rate drops quickly
Many pods stuck in “Terminating” and “ContainerCreating” state.
Hub pod is showing many timeout errors.
16:11#
Mount errors on build pods:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned build-devvyn-2daafc-2dfield-2ddata-6e8479-c1cecc to gke-prod-a-ssd-pool-32-134a959a-p2kz
Normal SuccessfulMountVolume 5m kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz MountVolume.SetUp succeeded for volume "docker-socket"
Warning FailedMount 4m (x8 over 5m) kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz MountVolume.SetUp failed for volume "docker-push-secret" : mkdir /var/lib/kubelet/pods/e8d31d4e-3537-11e8-88bf-42010a800059: read-only file system
Warning FailedMount 4m (x8 over 5m) kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz MountVolume.SetUp failed for volume "default-token-ftskg" : mkdir /var/lib/kubelet/pods/e8d31d4e-3537-11e8-88bf-42010a800059: read-only file system
Warning FailedSync 55s (x24 over 5m) kubelet, gke-prod-a-ssd-pool-32-134a959a-p2kz Error syncing pod
17:21#
Decided to increase cluster size to 4, wait for new nodes to come up, then cordon the two older nodes
17:30#
New nodes are up, old nodes are drained
Hub / binder pods show up on new nodes
Launch success rate begins increasing
Launch rate goes back to 100%
Lessons learnt#
What went well#
The problem was eventually resolved
What went wrong#
It was difficult to debug this problem as there was no obvious error message, and the person solving the problem wasn’t sure how to debug.
Action items#
Investigation#
The outage seemed to come from the deletion of a node, but it seemed to be related to other pre-existing nodes as well. Perhaps this is a general thing that happens when nodes become too old?
What went wrong#
There was a major outage that we were unable to debug, there were not clear errors in the logs
There was only one person available to debug, which made it more difficult to know how to proceed withot any feedback
Process improvements#
Documentation improvements#
Document how to “recycle” nodes properly (#528)