incident date: 2019-02-20, kubectl logs unavailable#

Summary#

For new builds users did not see the build logs. Builds did happen normally though. Fixed by removing node that had no IP address in its metadata.

Timeline#

All times in Central European Time (CET)

2019-02-20 08:03#

kubectl logs -f <pod> has stopped working, reporting:

Error from server: Get https://10.128.0.6:10250/containerLogs/prod/jupyter-.../notebook?follow=true: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user "gke-df315ed616f010764e03"?

This occurs for all pods. Failure to retrieve pod logs means that build logs are not being forwarded to users.

This occurred once previously on 2019-01-08 and lasted 24 hours. The first occurrence resolved itself without ever being diagnosed.

2019-02-20 10:12#

Deploy of a minor bump fails due to the failure to create ssh tunnels.

2019-02-20 10:20#

Planned upgrade of cluster begins, hoping that upgrading master and bringing in new nodes will fix the issue.

2019-02-20 11:17#

New nodes are fully up and running (old nodes are still present and draining), and kubectl logs still fails with:

Error from server: Get https://10.128.0.17:10250/...?follow=true: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user "gke-df315ed616f010764e03"?

At this point, we know that upgrading master and the nodes will not fix kubectl logs, at least for new nodes. We do not yet know whether removing old nodes will fix the issue (spoiler: it will). Investigation begins in earnest.

Troubleshooting documentation suggests that something could have gone wrong with the master’s ssh access to the nodes.

This is verified not to be the case by checking:

  1. ssh-key metadata is not too full

  2. ssh key for gke-df315ed616f010764e03 is present on all nodes in /home/gke-df315ed616f010764e03/.ssh/authorized_keys

  3. firewall rules allow ssh from master, which is doubly verified by attempting to ssh to gke-df315ed616f010764e03@a.b.c.d for a node from outside the cluster

  4. journalctl | grep df315 verifies that debugging attempts to ssh were rejected by public key and no such attempt was made from the master

At this point we know that there is nothing on the nodes preventing the master from creating an ssh tunnel. It must be something wrong on the master, and worryingly is something wrong on the master that is not fixed by upgrading the master.

12:00#

After some Googling, we found how to get the API server logs:

kubectl proxy &
curl -O http://localhost:8001/logs/kube-apiserver.log

Searching the API server logs for clues relating to ssh:

grep ssh kube-apiserver.log

revealed only one hint: a repeated occurrence of

E0220 10:59:57.797466       1 ssh.go:200] Failed to getAddresses: no preferred addresses found; known addresses: [{Hostname gke-prod-a-users-b921bb88-rd5d}]

along with numerous repeats of the much less informative error:

logging error output: "Error: 'No SSH tunnels currently open. Were the targets able to accept an ssh-key for user \"gke-df315ed616f010764e03\"?

This gives us hope that the issue may go away when the listed node gke-prod-a-users-b921bb88-rd5d is removed. The node has been cordoned and will be removed from the pool when it is empty.

13:30#

with rd5d drained and removed from the pool, kubectl logs is restored and everything is working.

Lessons learnt#

What went well#

  • Even throughout this process while kubectl logs didn’t work and upgrading the whole cluster, launch success rate never dropped below 97.5%, and only very briefly below 100%. Great job!

  • Once the problematic node was removed, everything recovered.

What went wrong#

Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items. For example,

  • It was very difficult to identify the root cause of the issue, in part because it appears to be a bug in the kubernetes master and/or the GKE service itself.

  • It would have been useful to get a kubectl describe node gke-prod-a-users-b921bb88-rd5d once we were suspicious of the node, to use as reference comparing the unhealthy node to healthy ones, especially now that we know that this node was indeed the cause

Where we got lucky#

  • While the event was occurring, the following command was run:

    kubectl get node -o json | jq '.items[] | [.status.addresses[].address]'
    

    in order to retrieve ip addresses of nodes for testing direct ssh access. The problematic node reported unusually having no known ExternalIP or InternalIP. A normal entry looks like:

    ["10.128.xxx.yyy", "35.202.xxx.yyy", "gke-prod-a-user-be5bcf07-xhxd"]
    

    while the unhealthy node reported

    ["", "gke-prod-a-users-b921bb88-rd5d"]
    

    indicating that kubernetes had lost track of the node’s ExternalIP and InternalIP. This was likely a direct symptom of the problem and could be used as a diagnostic in the future.

Action items#

Process improvements#

  1. test and notify when kubectl logs doesn’t work issue

Documentation improvements#

  1. add command to retrieve apiserver logs to sre guide issue

  2. describe how to identify suspicious networking status in sre guide issue