Photo by Marta Sher from Shutterstock
In modern application management, Kubernetes is the foundation of container orchestration. It automates software deployment, scaling, and management, revolutionising delivery. However, growing complexity and scale pose challenges in troubleshooting and maintaining dynamic ecosystems.
Kubernetes troubleshooting is complex due to a variety of factors. The Kubernetes cluster architecture consists of multiple components that work together, including pods, services, configurations, and networking. These components often interact in unpredictable ways, making it difficult to pinpoint the root cause of an issue.
Kubernetes workloads also constantly evolve to meet changing demands, resulting in transient issues that are challenging to diagnose in real time. Traditional troubleshooting methods involve manual investigation, examining logs, metrics, and configurations across various components, making the process time-consuming and prone to errors.
This is where the concept of interactive playbooks within Google Kubernetes Engine (GKE) comes into play. By introducing interactive playbooks, GKE brings structured, step-by-step troubleshooting guidance for common issues, and these new playbooks can help quickly resolve issues and improve Mean Time to Resolution, or MTTR.
The playbooks are available in the GCP monitoring dashboards and added automatically when the first workload is deployed to the cluster. Below is the available interactive playbooks list, and check the GKE release notes for new playbook announcements.
Screenshot from GCP Monitoring -> Dashboards
The interactive playbook uses cloud monitoring and cloud logging data, so ensure you have not disabled the log collection for workloads in the GKE standard cluster (Enabled by default in autopilot clusters).
Let's see the interactive playbook in action
- Deploy a faulty workload to the cluster that fails to start due to config issues.
kubectl run sample-app --image simbu1290/gke-faulty-app:latest
- Check the workload status in the console
- Click on the status CrashLoopBackOff for the sample app, and a screen appears with additional details. Under the recommendations section, you can see the interactive playbook related to the error.
- Click the View Interactive Playbook, and you will be navigated to the playbook dashboard in GCP monitoring. In the dashboard, you will see more details about the error and the next steps to identify the possible root cause.
Interactive playbook overview
Sample application error affecting the container startup process
You can also quickly identify the other reasons for failures with the help of out-of-memory and liveness probe options available in the dashboard. The
Correlate Change Events provides a quick way to see if any recent deployment changes might have affected your workload. You can then compare the changes between the current and previous versions to identify the cause of the issue.
Sample Email alert policy setup
The affected workload and interactive dashboard link details are included in the alert notification, and you can quickly begin troubleshooting.
Sample Email alert notification
In conclusion, With the introduction of recommended interactive playbooks, the GKE team aims to simplify the troubleshooting for common issues and ultimately maximise productivity. While interactive playbooks in GKE provide invaluable structured guidance for troubleshooting, There are scenarios where consulting a real person with specialized expertise becomes indispensable.
At doit.com, we have a global team of cloud architects who assist startups and major tech companies in solving intricate challenges on a daily basis.
Whether it's’ a unique configuration issue or a complex integration problem, our experts bring real-world experience to expedite the resolution process and enhance operational efficiency.