Observability on K8s — Monitor Kubernetes Clusters with DataDog

Anh Tu (James) Nguyen
Geek Culture
Published in
8 min readOct 17, 2020

--

The screenshot has taken in the DataDog integrations console

When you are managing more than 10 Kubernetes (EKS) clusters for your team with some critical applications running on them, you will need to set up an observability stack with monitoring, logging, and tracing.

In this blog, I will share how we build a monitoring system, set up a dashboard, and monitors our K8s infrastructure using DataDog.

Why DataDog?

There are many observability tools or platforms, both open-source and subscription, but I chose DataDog for my blog because I found it has wide support for multiple platforms (AWS, Azure, GCP, on-premise, etc.) and applications, as well as its compatibility with Kubernetes or Prometheus metrics. Also, the main thing is I’m a (Data)Dog’s fan.

You can check out the list of DataDog built-in integrations or refer to their integrations-core and write your own scripts.

Requirements

For some reason, in this blog, I will use DataDog Helm Chart from a stable repository and version 1.39.9. Since version 2.x.x, DataDog Helm Chart has refactored and moved to its official repo, but the concepts are the same.

Deploying

I have generated sample values.yaml for DataDog deployment as follows:

I’m using the EKS cluster for this tutorial, so you can see the tags defined as cloud:aws and distribution:eks, and naming my cluster as the following convention <env>-<platform>-<category>-xxx, which is dev-eks-datadog-001.

Labels or tags are very important when you design and set up a monitoring system for your core infrastructure, which will help you easily organize, group, filter or focus your data to troubleshoot and understand your environment. I will share more examples later in the Dashboards and Monitors sections.

Deploy DataDog using helm:

➜  ~ helm install --name datadog --namespace monitoring -f datadog-k8s-values.yaml --set datadog.apiKey=<API_KEY> --set datadog.appKey=<APPLICATION_KEY> --version=1.39.9 stable/datadog

When helm command finished, you will see some things similar as:

➜  ~ kubectl get pods -n monitoring --no-headers | grep datadog
datadog-2tspw 1/1 Running 0 2m
datadog-6bd328d579-tj8lp 1/1 Running 0 2m33s
datadog-cluster-agent-54hv37f8fb-cgp2f 1/1 Running 0 2m23s
datadog-kube-state-metrics-56gre5ft89-lx8e9 1/1 Running 0 1m56s
datadog-pvltl 1/1 Running 0 2m55s
datadog-qtlh9 1/1 Running 0 3m12s
➜ ~ kubectl get svc --no-headers | grep datadog
datadog ClusterIP 172.20.6.24 <none> 8125/UDP 2m
datadog-cluster-agent ClusterIP 172.20.10.15 <none> 5005/TCP 2m
datadog-cluster-agent-metrics-api ClusterIP 172.20.12.27 <none> 443/TCP 2m
datadog-kube-state-metrics ClusterIP 172.20.24.17 <none> 8080/TCP 2m
➜ ~ kubectl get deployments --no-headers | grep datadog
datadog 1/1 1 1 3m
datadog-cluster-agent 1/1 1 1 3m
datadog-kube-state-metrics 1/1 1 1 3m
➜ ~ kubectl get daemonsets --no-headers | grep datadog
datadog 3 3 3 3 3 <none> 3m

You can see DataDog has been deployed with 3 main components:

  • DataDog agent daemonSet.
  • DataDog cluster-agent.
  • DataDog Kube-state-metrics (KSM): Which is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects: node status, node capacity (CPU and memory), number of desired/available/unavailable/updated replicas per Deployment, pod status (e.g., waiting, running, ready), and so on.

To understand about DataDog agents and cluster-agent, let me explain a bit about old and current DataDog designs, which can be found in their official docs.

Without Cluster Agent diagram on DataDog

You can see above diagram, without Cluster Agent, every worker nodes in the cluster ran an agent pod that collected data from two main sources:

  • kubelet:

By monitoring the kubelet on each worker node, the Datadog Agent gives you insights into how your containers are behaving and helps you keep track of scheduling-related issues. The Agent also retrieves system-level data and automatically discovers and monitors applications running on the node.

  • The cluster’s control plane, which consists of the API server, the scheduler, the controller manager, etc.

In addition to collecting these node-level metrics, each Datadog Agent individually queries the API server on the master node to collect data about the behavior of specific Kubernetes components, as well as to gather key metadata about the cluster as a whole.

Each Agent also retrieves the list of services that target the pods scheduled on that particular node, uses this data to map relevant application metrics to services, and then tags each metric with the appropriate pod name and service. Agents can also be configured to elect a leader that queries the API server regularly to collect Kubernetes events.

Since the old approach gave your visibility into all the layers of the cluster, it put an increasing load on the API server and etcd when the cluster size increased.

That’s why DataDog has developed and introduced Cluster Agent as you can see in the below diagram:

With Cluster Agent diagram on DataDog

DataDog Cluster Agent main features:

  • Provides a streamlined, centralized approach to collecting cluster-level monitoring data.
  • By acting as a proxy between the API server and node-based Agents, the Cluster Agent helps to alleviate server load.
  • Relays cluster-level metadata to node-based Agents, allowing them to enrich the metadata of locally collected metrics.

Using the Datadog Cluster Agent allows you to:

  • Alleviate the impact of Agents on your infrastructure.
  • Isolate node-based Agents to their respective nodes, reducing RBAC rules to solely read metrics and metadata from the kubelet.
  • Provide cluster-level metadata that can only be found in the API server to the Node Agents, for them to enrich the metadata of the locally collected metrics.
  • Enable the collection of cluster-level data, such as the monitoring of services or SPOF and events.
  • Leverage Horizontal pod autoscaling with custom Kubernetes metrics.

With this setup, the DataDog agent doesn’t need to elect a leader pod, and the cluster agent will do the K8s events collection instead.

DataDog cluster-agent also provides External Metrics Provider to define HPA based on DataDog metrics (not limiting to only CPU/Memory utilization).

DataDog Events

In this setup, I have enabled the K8s Event Collection, by setting the datadog.leaderElection, datadog.collectEvents and agents.rbac.createoptions to true in datadog-k8s-values.yaml file.

That feature tells DataDog Cluster Agent to collect the K8s events from Control Plane and forward them to DataDog. You can easily check what happens to your clusters, why pod can’t be scheduled, or troubleshoot a BackOff container without the need of kubectl get events or kubectl describe nodes|pods.

K8s events — Backoff

We can easily create dashboards or monitors to watch and notify any K8s events in nearly real-time.

Dashboards

In this section, I will share my own experience on building a common Dashboard for all running Kubernetes clusters (EKS).

I used to break down the metrics for different layers and group them as follows:

  • Cloud/Cluster infrastructure metrics: EC2, RDS, ElastiCache, etc.
  • System metrics: OS, CPU, Memory, Disk IOps, Network bandwidth, etc.
  • Orchestration (K8s) metrics: Node status, deployments/replicaSet/statefulSet, pods (CPU/Mem), PVC size, etc.
  • Application metrics: HTTP response time, Kafka consumer lag, JVM Heap, Logstash events, Spark job metrics, etc.

An example dashboard I created which provides an overview for all running EKS clusters in my team:

Cluster-metrics graphs — 1
Cluster-metrics graphs — 2

DataDog supports configure template variables to quickly filter or query metrics from specific tags or attributes, e.g subaccount (AWS accounts), k8s_cluster (K8s cluster names), k8s_node (EC2 Worker nodes), k8s_namespace, k8s_deployment, k8s_daemonset and so on.

DataDog template variables

You will be noticed why I’m using 2 different tags to filter the same thing, like k8s_namespace and k8s_state_namespace. That is because there are 2 sets of metrics are gathered by DataDog and being tagged differently:

  • kubernetes.* metrics are collected by DataDog agents/cluster-agent.
  • kubernetes_state.* metrics are gathered from the Kube-state-metrics (KSM) API.

Besides cluster graphs, I also added system metrics and K8s metrics into the dashboard like this:

System-metrics graphs — 1
K8s-metrics graphs — 1
K8s-metrics graphs — 2

You can add more metrics to your dashboard to have more observations on your infrastructure, but I would recommend we split different layers (Cluster/Orchestrator/System/Application/etc.) into different dashboards. It’s because your dashboard UI will become laggy when you’re querying a huge metrics subset or huge time range (1 week or 1 month).

Monitors

Before creating a monitor, I used to think about some basic information:

  • What are the things to look out for when you pick out the metrics? (High CPU/Mem, Events drop, error rates, HTTP slow response, p95/p99, etc.)
  • Calculate metrics and set a threshold (min/max, avg/sum, week/day before, etc.).
  • Metrics tagging/labeling.
  • Runbook for troubleshooting steps and link to logs.
  • Alerts notification (Slack, PagerDuty).

As I mentioned above, labels or tags are important, especially when we design and set up common alerts for the core infra. Let me share one example: Creating alert about failed pods on namespace:

  • Metrics to check: kubernetes_state.pod.status with phase:failed
Failed pods alert — 1
  • sum by: kubernetes_cluster, cluster_env and namespace — the idea is when a namespace in the cluster has so many failed pods, on-call engineers need to quickly investigate the root cause.
  • Set Alert and Warning thresholds for the monitor:
Failed pods alert — 2
  • Metrics tagging/labeling: as above I have defined a list of tags in sum by, DataDog allows us to define alert message with those variables:
Failed pods alert — 3
  • Then you can see the alert message as below:
Failed pods alert — 4

@slack-k8s-{{cluster_env.name}}-alerts: We can simply diversify the alerts from the different environment (dev, stage, or prd) to different Slack channels (k8s-dev-alerts or k8s-stg-alerts or k8s-prd-alerts) and only trigger PagerDuty to oncall (@pagerduty-de-infra) if failure happens in a production environment.

An example alert to Slack will be looked like this:

Failed pods alert — 5

What’s Next

This blog is based on my own experience working with DataDog on an existing EKS setup. It may or may not work in your existing environment. You can visit DataDog's official documentation for more information. Feel free to leave comments or questions.

Hope you enjoy reading my blog. In the next one, I will share about DataDog Autodiscovery and DogStatsD.

--

--