Prometheus

Type of logs

  • Transaction logs
  • Request logs
  • Application logs
  • Debug logs

Using Expression browser

Metrics:

  • up
  • Gauges: process_resident_memory - metric type gauges . For a gauge, its current absolute value is important from a monitoring point of view
  • Counters: prometheus_tsdb_head_samples_appended_total metric type - counter. Number of samples prometheus has ingested
  • Rate: rate(prometheus_tsdb_head_samples_appended_total[1m]) to compute rate per minute. The rate function automatically handles counters resetting due to processes restarting and samples not being exactly aligned. This can lead to rates on integers returning noninteger results, but the results are correct on average.
  • Debug logs

Running the Node Exporter

Exposes kernal and machine level metrics on Unix systems. Provides all standard metrics such as CPU, memory, sick space, disk IP, network bandwidth.

  • Label matcher: use labels to filter metrics by job names. e.g.process_resident_memory_bytes{job="node"}
  • rate(node_network_receive_bytes_total[1m]) to get rate of how many bytes have been received by network interfaces.

Alerting

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
global:
  scrape_interval: 10s
  evaluation_interval: 10s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
rule_files:
  - rules.yml
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']
  - job_name: 'example'
    static_configs:
     - targets:
       - localhost:8000

rules.yml:

1
2
3
4
5
6
groups:
 - name: exmaple
   rules:
   - alert: InstanceDown
     expr: up == 0
     for: 1m

global: smtp_smarthost: ’localhost:25' smtp_from: ‘youraddress@example.org’ route: receiver: example-email receivers:

Instrumentation

Define variables at file level to avoid using same metric name in the application.

1
2
3
4
5
6
7
global:
  scrape_interval: 10s
scrape_configs:
 - job_name: example
   static_configs:
    - targets:
       - localhost:8000

`

  • python_info
  • rate(hello_world_exceptions_total[1m])/rate(hello_worlds_total[1m]) to compute ratios

  • prometheus uses 64 bit floating-point numbers for values.

Examples of gauge:

  • the number of items in a queue
  • memory usage of a cache
  • number of active threads
  • the last time a record was processed
  • average requests per second in the last minute 3

time() will tell how many seconds its is since the last request

Writing exporters

  • Metric names for applications should generally be prefixed by the exporter name, e.g. haproxy_up.
  • Metrics must use base units (e.g. seconds, bytes) and leave converting them to something more readable to graphing tool
  • expose ratios, not percentages. Even better, specify a counter for each of the two components of the ratio.
  • Prometheus metrics and label names are written in snake_case
  • Exposed metrics should not contain colons, these are reserved for user defined recording rules to use when aggregating.
  • Only [a-zA-Z0-9:_] are valid in metric names, any other characters should be sanitized to an underscore.
  • The _sum, _count, _bucket and _total suffixes are used by Summaries, Histograms and Counters. Unless you’re producing one of those, avoid these suffixes.
  • _total is a convention for counters, you should use it if you’re using the COUNTER type.
  • The process_ and scrape_ prefixes are reserved
  • It’s good practice to also have an exporter-centric metric, e.g. jmx_scrape_duration_seconds, saying how long the specific exporter took to do its thing
  • For process stats where you have access to the PID, both Go and Python offer collectors that’ll handle this for you
  • When you have a successful request count and a failed request count, the best way to expose this is as one metric for total requests and another metric for failed requests. This makes it easy to calculate the failure ratio. Do not use one metric with a failed or success label. Similarly, with hit or miss for caches, it’s better to have one metric for total and another for hits.
  • A HELP string with the original name can provide most of the same benefits as using the original names.
  • Avoid type as a label name, it’s too generic and often meaningless. You should also try where possible to avoid names that are likely to clash with target labels, such as region, zone, cluster, availability_zone, az, datacenter, dc, owner, customer, stage, service, environment and env
  • average latency: rate(hello_world_latency_seconds_sum[1m]) / rate(hello_world_latency_seconds_count[1m])
  • RED - request rate, error rate, latency(duration)
    • curl http://localhost:8000/metrics | promtool check-metrics

You may have noticed that the example counter metrics all ended with _total , while there is no such suffix on gauges. This is a convention within Prometheus that makes it easier to identify what type of metric you are working with. In addition to _total , the _count , _sum , and bucket suffixes also have other mean‐ ings and should not be used as suffixes in your metric names to avoid confusion. It is also strongly recommended that you include the unit of your metric at the end of its name. For example, a counter for bytes processed might be myapp_requests processed_bytes_total . Even though the hello_world_latency_seconds metric is using seconds as its unit in line with Prometheus conventions, this does not mean it only has second precision. Prometheus uses 64-bit floating-point values that can handle metrics ranging from days to nanoseconds. As summarys are usually used to track latency, there is a time context manager and function decorator that makes this simpler, as you can see in Example 3-10. It also handles exceptions and time going backwards for you. Summary metrics may also include quantiles, A summary will provide the average latency, but what if you want a quantile? Quan‐ tiles tell you that a certain proportion of events had a size below a given value. For example, the 0.95 quantile being 300 ms means that 95% of requests took less than 300 ms. Quantiles are useful when reasoning about actual end-user experience. If a user’s browser makes 20 concurrent requests to your application, then it is the slowest of them that determines the user-visible latency. In this case the 95 th percentile captures that latency. The instrumentation for histograms is the same as for summarys Buckets The default buckets cover a range of latencies from 1 ms to 10 s. This is intended to capture the typical range of latencies for a web application. But you can also override them and provide your own buckets when defining metrics. This might be done if the defaults are not suitable for your use case, or to add an explicit bucket for latency quantiles mentioned in your Service-Level Agreements (SLAs).