Running containers in Kubernetes has many advantages, but it can be difficult to monitor performance and availability of your constantly changing infrastructure. Kubernetes monitoring best practices are techniques you can use to make monitoring practical and manageable in a cloud native environment.
This is part of our series of articles on cloud monitoring.
In this article, you will learn about the following best practices:
In addition, you’ll learn about:
Before you start monitoring, consider how to instrument your Kubernetes environment, to ensure it records the metrics you’ll need to monitor ongoing operations.
There are three main strategies to collect system events and metrics in Kubernetes:
The first two strategies are problematic for several reasons:
eBPF probes can overcome all these issues, because they act as modules in the Linux kernel and can accept system calls from multiple containers. In this way, you can gather all the information you need to troubleshoot, analyze root cause, and monitor performance in your Kubernetes environment.
Other processes running at the user level can combine the eBPF data with other data sources (for example, Prometheus, JMX, or system logs, etc.) and report it to the monitoring backend. eBPF test uses less RAM than embedded monitoring instrumentation, and has little impact on CPU usage or other processes.
Related content: learn more in our in-depth guides to:
Due to the dynamic changes to Kubernetes resources and the assumption that deployed replicas are symmetrical, monitoring individual container resources can be very noisy.
Because metrics can change on an hourly basis, it is more important to look at patterns over long periods of time for groups of containers. For example, when a new ReplicaSetID is created, the ReplicaSet metrics are reset. You can use cAdvisor to aggregate metrics from multiple containers, including CPU, memory, and network usage.
Detailed resource metrics (CPU, load, memory, etc.) are important to track, but they are not closely correlated with problems that directly impact users. A better KPI is API indicators such as call errors, request rates, and timeouts, and can help you quickly determine if there is a user-facing or application-facing problem with your microservices.
The easiest way to get information about service level metrics is to use a service load balancer (preferably an ingress controller like NGINX or Istio). This can allow you to automatically detect anomalies in REST API requests, in a standardized way across all Kubernetes services. You can raise alerts at any level of the API lifecycle, and also use alerts to automate infrastructure changes.
You should monitor the kube-system as closely as possible. Problems inside the Kubernetes clusters are typically the most difficult to solve, and should be detected early. These can include DNS bottlenecks, network congestion, or worst of all, etcd problems. In particular, it is important to monitor the performance of master nodes including CPU usage, memory, and disk space.
It is important to track errors, crashes, and performance issues at every layer of the Kubernetes environment. Complex issues may require debugging at multiple levels, and engineers need to have access to metrics for every component. For example, when debugging an issue, a developer may need to:
High disk utilization is a common problem on any system, and Kubernetes nodes are no exception. If you are using StatefulSet resources or volumes that are statically attached to nodes, there is no quick fix.
Disk utilization warnings are almost always severe and usually indicate a problem with the application. Make sure to keep track of all disk volumes, including the root file system, and set alerts for around 80% utilization. Over time, try to see if there are patterns to high disk usage and make changes to your deployment to address the root cause.
NetApp Cloud Insights is an infrastructure monitoring tool that gives you visibility into your complete infrastructure. With Cloud Insights, you can monitor, troubleshoot and optimize all your resources including your public clouds and your private data centers.
Cloud Insights helps you find problems fast before they impact your business. Optimize usage so you can defer spend, do more with your limited budgets, detect ransomware attacks before it’s too late and easily report on data access for security compliance auditing.
In particular, NetApp Cloud Insights helps you gain an understanding of your Kubernetes architecture through topology visualization, including relationships between persistent volume claims and the storage infrastructure they’re using, and monitor health of Kubernetes clusters .
Start a 30-day free trial of NetApp Cloud Insights. No credit card required.