Key monitoring concepts you should be aware of

Discover key monitoring concepts you need to be aware of when working with monitoring and time series visualization tools like Prometheus, Victoria Metrics or Grafana

Key monitoring concepts you should be aware of
Photo by Luke Chesser / Unsplash

Metric

What is a metric

A numeric measure or observation of something. Here are example metrics about requests on a web application. The name of the metrics should clarify what is actually measured:

  • requests_total
  • requests_success_total
  • request_errors_total

Time serie

What is a time serie

A combination of a metric and it's labels.

requests_total{path="/", code="200"}

path="/" and code="200" are two labels associated to the requests_total metric.

Time series labels are key/value pairs. Time series labels with same labels keys but different labels values are different time series. Here is an example:

# Two different time series
requests_total{path="/", code="200"}
requests_total{path="/contact", code="200"}

The requests_total{path="/", code="200"} time serie could also be written
like this:

# __name__ is a special label that can be used to indicate the metric name
{__name__="requests_total", path="/", code="200"} 

# __name__ can also be ommitted
{"requests_total", path="/", code="200"}

Cardinality

What is cardinality

For the monitoring system, cardinality is the number of unique time series. In a metric point of view, cardinality is the number of unique time series produced for that given metric.

High cardinality may increase memory usage.

Churn rate

The speed at which old time series are replaced by new ones. High churn rate is mainly associated with labels whose values change frequently (timestamp, queryid, hash, etc).

High churn rate increases the total number of time series inside the monitoring system's database and may slow-down queries over multiple days.

Raw sample (or data point)

What is a raw sample or data point

The (value, timestamp) pair associated to a time serie. A raw sample is also called a data point.

# Raw sample in Prometheus text exposition format
requests_total{path="/", code="200"} 123 4567890

The raw sample or data point associated to the requests_total{path="/", code="200"} time serie is represented by 123 (sample value) and 4567890 (sample timestamp).

Sample's timestamp is added by the program that collects the metric in
Pull model monitoring systems.

In Push model monitoring systems, the timestamp is added directly by the application or client sending the metric.

Time series resolution (or step)

What is time series resolution

The minimum interval between raw samples (or data points) of a time serie. A time serie whose value is updated every 30 seconds has a resolution of 30 seconds.

In Pull model monitoring systems, resolution is controlled by clients collecting (scraping) the metrics and corresponds to the scrape interval (time interval separating two scrapes).

In Push model monitoring systems, resolution is an interval between time series raw samples timestamps received by the monitoring system.

Instant query and range query

Deduplication

Ensures only the last raw sample of time series is kept for each discrete X time-unit. If we have multiple scrapers on same targets, sending metrics to the monitoring system every 15s, configuring deduplication with X=15s can be useful to cleanup received duplicated data and avoid wasting storage space.

Downsampling

For each specific interval (5 minutes for instance), keep only the last sample among samples older that X days. Some monitoring tools like Victoria Metrics support configuring downsampling also per different sets of time series. Have a look at Victoria metrics downsampling for more.

Relabeling

Consists of modifying time series labels before they are stored. Have a look at
Prometheus-compatible relabeling for Prometheus/Victoria Metrics compatible relabeling examples.

Types of metrics

What are the different types of metrics

Counter

What is a counter metric

  • Count some events (number of requests, logs, etc)
  • Increases or stays the same over time
  • Decreases only when the metric is reset to zero (restart of exposing service)

Well named counter metrics will generally have the following suffixes:

  • _total
  • _sum
  • _count

Most common metrics query languages functions used with counters are rate and increase

Gauge

What is a gauge metric

Histogram

What is a histogram metric

Summary

What is a summary metric

Commonly used metrics query languages functions

Rate and Increase

rate | increase

Mostly used on counter metrics. Here is the data sample we are going to use to clarify what those functions do:

nginx_http_requests_total  133  1740144001  # 2025-02-22T14:20:01Z
nginx_http_requests_total  133  1740144016  # 2025-02-22T14:20:16Z
nginx_http_requests_total  854  1740144031  # 2025-02-22T14:20:31Z
nginx_http_requests_total  854  1740144046  # 2025-02-22T14:20:46Z
nginx_http_requests_total  1671 1740144061  # 2025-02-22T14:21:01Z

This is the data returned by the nginx_http_requests_total query on the time
ranging from 2025-02-22T14:20:01Z to 2025-02-22T14:21:01Z (1 minutes).

If we run the increase(nginx_http_requests_total[1m]), we will calculate the number of new requests over the last one minute, between b=2025-02-22T14:21:01Z and a=2025-02-22T14:20:01Z:

  • (value at b) - (value at a) = 1671 - 133 = 1538 new requests.

If we run the rate(nginx_http_requests_total[1m]) on that same time range, we will calculate the average speed at which requests increase in that time range, over the last minute (requests / second):

  • [(value at b) - (value at a)] / (calculation time range in brackets)
  • [1671 - 133] / 60 = 1538 / 60 = 25.63 requests / second

Aggregate and rollup functions