[ARTICLE]

Monitoring Kubernetes with Prometheus

Kubernetes and Prometheus were the first and second open-source projects brought onboard to the then newly minted Cloud Native Computing Foundation (CNCF).  Both systems were designed as cloud-native tools from the start. While Kubernetes was designed to manage running microservices based applications at scale,  Prometheus was specifically designed to monitor and provide alerts for such systems. 

The Kubernetes monitoring challenge

While being an incredibly powerful and performant tool to manage container-based applications reliably, Kubernetes is also a complex tool with multiple components. A Kubernetes cluster involves multiple servers that can span private and public cloud services. Kubernetes is also frequently deployed with additional services that provide performance enhancements. Unlike troubleshooting a single application on a single server, there is a good chance that there are multiple logs and services that need to be looked at when troubleshooting Kubernetes. 

What does it take to monitor a Kubernetes Cluster?

 Before we consider why Prometheus meets the monitoring tool requirements for K8s, let’s consider what needs to be monitored.  The master or controller node is the command center for the cluster and maintains the configuration database (etcd), an API server, and scheduler.  Each worker node will have a node agent that communicates with the master node via the API server, a proxy service, and a container runtime. There are also many K8s addons that extend Kubernetes functionality, with networking function and policy being the most popular category by far. Below you can see the Kubernetes-specific components involved in a typical Kubernetes cluster to highlight some of the complexity a monitoring tool needs to be capable of collecting data from.

Why use Prometheus to monitor Kubernetes

Let’s first consider some of the features you get when you install Prometheus.

  • time series data that are identified by metric name and key/value pairs 
  • a multi-dimensional data model
  • pull model for time series collection over HTTP
  • time series pushing is also supported
  • Monitoring targets can be statically configured or automatically discovered (service discovery)
  • PromQL, the powerful and flexible Prometheus query language for exploring your monitoring data

These core functions should convince you that Prometheus is a powerful and high-performing monitoring tool.

Service Discovery and Pull-based monitoring

One of the powers of Kubernetes and challenges for any monitoring system is that K8s can spin up new Pods to meet demand or replace failed Pods automatically. At any given moment, it is difficult (and also unnecessary) to know exactly where a set of Pods comprising a Service are running. Prometheus provides service discovery that automatically finds the new Pods and automatically starts to pull metrics from those pods. This pull-based model of metric collection and service discovery matches very well to dynamic cloud environments’ demands.

Labels

The use of Labels, which are simply key-value pair designations, is a concept shared by both Kubernetes and Prometheus.  In Kubernetes, labels can designate services, releases, customers, environments (prod vs. dev), and much more.  While you can create labels in Prometheus using PromQL, the query language can natively use the labels defined in your Kubernetes environment as well. Labels can then be used to select the time series of interest and match labels to further aggregate metrics. 

Exporters and Kubernetes Pods

Prometheus natively instruments Kubernetes components, but there are times when you need to monitor a system that is not natively integrated with Prometheus (e.g., Postgres).   In this case, you can co-deploy an exporter in a Pod that runs alongside your service. The role of the exporter is to translate the service translates metrics into one consumable by Prometheus.

Deciding which metrics to monitor

You could decide to, and Prometheus could handle, instrumenting everything. However, it is possible for the metrics storage to become a limiting factor. The Kubernetes community generally agrees that there are four principal types of metrics that should be monitored: Pods/Deployments, Node Resources (disk I/ORunning pods and their deployments), container-native metrics, and app metrics.  Several frameworks (e.g., USE and RED) can be used to decide about additional metrics to include.

Conclusion

From an operator’s or SRE’s perspective, a monitoring tool needs to be able to collect metrics from a complex and changing system and should not be difficult to manage. Prometheus addresses both challenges.  Because of the deep and native integration between Kubernetes and Prometheus, it is remarkably easy to get up and running. It is also easy to get metrics on high-level constructs such as Services and node resources, and it is also easy to zoom in close to look at Pod, Container, and application metrics. Together, Kubernetes and Prometheus give you the data needed to ensure that overall system function is acceptable, e.g., tracking SLOs.  The combination also allows operators to identify resource bottlenecks and use that information to improve overall application performance. For many looking for a Kubernetes monitoring solution, Prometheus is an easy first step, and for many, thanks to its simplicity and power, it is the last step.