Blog

Upgrading Google Kubernetes Engine

At DoiT International we work with customers small and large, and from time to time we recognize common issues, especially with some of our larger-scale customers. A recent issue upgrading Google’s GKE (managed Kubernetes), we feel is worth sharing in case others are planning their upgrades or run into similar issues.

Symptoms

It’s important to say, not all customers experience this, but the following symptoms we’ve witnessed:

after the upgrade, kubectl CLI cannot interact with cluster (API-Server not responding)
the upgrade process appears to be “hanging” and not completed after 20+ minutes
error notice in console or logs citing the following or similar “All cluster resources were brought up, but: component “kube-apiserver” from endpoint “gke-XXXXXXXX-XXXXXXX” is unhealthy.”

Risk Diagnosis

Although not everyone experiences this, the commonalities we’ve witnessed include:

GKE cluster version below 1.16
Zonal cluster (single-zone master for control plane)
“Chatty” workloads that continuously interact with API-Server like Istio, Flux, or ArgoCD
Upgrades between versions like 1.12 -> 1.13, 1.13 -> 1.14, 1.14 -> 1.15, 1.15 -> 1.16 (typically not seen during patch updates)

Who might be impacted?

We’d like to reiterate that this has only occurred with a few customers thus far and most fit that profile with zonal clusters and heavy workloads hitting the API-Server that caused it to take too long to pass health checks on upgrades, thus stuck in a “hung” state. We hope this information is helpful either for planning future version upgrades or troubleshooting existing issues.

Remediation

Google support case to increase health check timeout

If you are already experiencing this issue, but your worker nodes are still serving traffic (just no kubectl access to control plane), you can submit a support request with Google Support, or your technical support partner (we hope it’s DoiT International), to increase the health check timeout to 3 minutes to allow more time for API-Server to recover, preventing a cycle of failed health checks.

Reduce node pool size and load on API-Server

If you can afford potential downtime of worker nodes, to ease the pressure off your API-Server you could either disable the “chatty” workloads and scale them down for a period of time, or simply scale down your node pool to 0 and let the upgrade complete, and then scale it back up.

Alternative potential causes

Some of our engineers have come across scenarios where the control plane was inaccessible related to a race condition caused by the Linux netfilter/conntrack table. This has been fixed but older versions were susceptible to it.

Long-term solution

Google engineers are aware of the issue and a fix is planned within the next month for version 1.16 or later (manual upgrade). At the time of this blog, there is no public issue tracker link.

Upgrade your cluster from zonal to regional

Unfortunately, there is no “easy button” to upgrade a cluster from zonal to regional but one of our cloud architects has an article that describes one approach for migration using a popular open-source tool, Velero.

Subscribe to updates, news and more.

Related blogs

Let’s do it

From cost optimization to cloud migration, machine learning and CloudOps,
we’re here to make the public cloud easy.

Blog

Upgrading Google Kubernetes Engine

Symptoms

Risk Diagnosis

Who might be impacted?

Remediation

Google support case to increase health check timeout

Reduce node pool size and load on API-Server

Alternative potential causes

Long-term solution

Upgrade your cluster from zonal to regional

Subscribe to updates, news and more.

Related blogs

Company

Offering

Support

Never miss an update.

Subscribe to updates, news and more.

Connect With Us