入门 | Prometheus - Prometheus 监控系统

下载并运行 Prometheus
配置 Prometheus 监控自身
启动 Prometheus
使用表达式浏览器
使用图形界面
启动一些示例目标
配置 Prometheus 监控示例目标
配置规则将抓取的数据聚合成新的时间序列
重新加载配置
优雅地关闭实例。

本指南是一个“Hello World”风格的教程，展示如何安装、配置和使用一个简单的 Prometheus 实例。您将本地下载并运行 Prometheus，配置它抓取自身和一个示例应用程序的数据，然后使用查询、规则和图表来处理收集到的时间序列数据。

下载并运行 Prometheus

下载 Prometheus 的最新版本到您的平台，然后解压并运行它

tar xvfz prometheus-*.tar.gz
cd prometheus-*

在启动 Prometheus 之前，我们先来配置它。

配置 Prometheus 监控自身

Prometheus 通过抓取指标 HTTP 端点从目标收集指标。由于 Prometheus 也以相同方式暴露自身的数据，它也可以抓取和监控自身的健康状况。

虽然一个只收集自身数据的 Prometheus 服务器不是很有用，但它是一个很好的入门示例。将以下基本 Prometheus 配置保存为名为 prometheus.yml 的文件

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

有关配置选项的完整说明，请参阅配置文档。

启动 Prometheus

要使用新创建的配置文件启动 Prometheus，请切换到包含 Prometheus 二进制文件的目录并运行

# Start Prometheus.
# By default, Prometheus stores its database in ./data (flag --storage.tsdb.path).
./prometheus --config.file=prometheus.yml

Prometheus 应该会启动。您也应该能够访问其自身的状态页面：localhost:9090。给它几秒钟时间从其自身的 HTTP 指标端点收集关于自身的数据。

您还可以通过访问其指标端点来验证 Prometheus 是否正在提供关于自身的指标：localhost:9090/metrics

使用表达式浏览器

让我们探索 Prometheus 收集到的关于自身的数据。要使用 Prometheus 内置的表达式浏览器，请访问 http://localhost:9090/graph 并在“Graph”标签页内选择“Table”视图。

正如您可以从 localhost:9090/metrics 获取到的信息，Prometheus 导出的关于自身的指标之一是 prometheus_target_interval_length_seconds（目标抓取之间的实际时长）。在表达式控制台中输入以下内容，然后点击“Execute”

prometheus_target_interval_length_seconds

这将返回多个不同的时间序列（以及每个序列记录的最新值），每个序列的指标名称都是 prometheus_target_interval_length_seconds，但带有不同的标签。这些标签表示不同的延迟百分位数和目标组间隔。

如果我们只对第 99 百分位延迟感兴趣，可以使用此查询

prometheus_target_interval_length_seconds{quantile="0.99"}

要计算返回的时间序列数量，您可以编写

count(prometheus_target_interval_length_seconds)

有关表达式语言的更多信息，请参阅表达式语言文档。

使用图形界面

要绘制表达式图表，请访问 http://localhost:9090/graph 并使用“Graph”标签页。

例如，输入以下表达式可以绘制自抓取 Prometheus 中每秒创建的块的比率图表

rate(prometheus_tsdb_head_chunks_created_total[1m])

尝试调整图表范围参数和其他设置。

启动一些示例目标

让我们为 Prometheus 添加更多抓取目标。

Node Exporter 被用作示例目标，有关如何使用它，请参阅这些说明。

tar -xzvf node_exporter-*.*.tar.gz
cd node_exporter-*.*

# Start 3 example targets in separate terminals:
./node_exporter --web.listen-address 127.0.0.1:8080
./node_exporter --web.listen-address 127.0.0.1:8081
./node_exporter --web.listen-address 127.0.0.1:8082

您现在应该有示例目标正在监听 http://localhost:8080/metrics、http://localhost:8081/metrics 和 http://localhost:8082/metrics。

配置 Prometheus 监控示例目标

现在我们将配置 Prometheus 抓取这些新目标。我们将所有三个端点归入一个名为 node 的作业。我们可以想象前两个端点是生产目标，而第三个代表一个金丝雀（canary）实例。为了在 Prometheus 中实现这一点，我们可以将多组端点添加到单个作业中，并为每组目标添加额外的标签。在此示例中，我们将为第一组目标添加 group="production" 标签，同时为第二组添加 group="canary"。

为了实现这一点，将以下作业定义添加到您的 prometheus.yml 文件中的 scrape_configs 部分，然后重启 Prometheus 实例

scrape_configs:
  - job_name:       'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

前往表达式浏览器，验证 Prometheus 现在是否拥有这些示例端点暴露的时间序列信息，例如 node_cpu_seconds_total。

配置规则将抓取的数据聚合成新的时间序列

虽然在我们的示例中不是问题，但对数千个时间序列进行聚合的查询在即时计算时可能会变慢。为了提高效率，Prometheus 可以通过配置的 记录规则 将表达式预先记录到新的持久化时间序列中。假设我们有兴趣记录每实例（保留 job、instance 和 mode 维度）所有 CPU 的平均 CPU 时间（node_cpu_seconds_total）的每秒速率，衡量窗口为 5 分钟。我们可以写成

avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

尝试绘制此表达式的图表。

要将此表达式产生的时间序列记录到一个名为 job_instance_mode:node_cpu_seconds:avg_rate5m 的新指标中，创建一个包含以下记录规则的文件并将其保存为 prometheus.rules.yml

groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

为了让 Prometheus 识别这个新规则，在您的 prometheus.yml 文件中添加一个 rule_files 语句。现在的配置应该看起来像这样

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.

  # Attach these extra labels to all timeseries collected by this Prometheus instance.
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'

scrape_configs:
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

  - job_name:       'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

使用新配置重启 Prometheus，并通过表达式浏览器查询或绘制图表来验证名为 job_instance_mode:node_cpu_seconds:avg_rate5m 的新时间序列现在是否可用。

重新加载配置

正如配置文档中所述，Prometheus 实例可以通过使用 SIGHUP 信号来重新加载配置而无需重启进程。如果您在 Linux 上运行，可以使用 kill -s SIGHUP <PID> 命令来实现，将 <PID> 替换为您的 Prometheus 进程 ID。

优雅地关闭实例。

虽然 Prometheus 在进程突然失败的情况下确实有恢复机制，但建议使用信号或中断来干净地关闭 Prometheus 实例。在 Linux 上，这可以通过向 Prometheus 进程发送 SIGTERM 或 SIGINT 信号来完成。例如，您可以使用 kill -s <SIGNAL> <PID> 命令，将 <SIGNAL> 替换为信号名称，将 <PID> 替换为 Prometheus 进程 ID。或者，您可以在控制终端按下中断字符，默认情况下是 ^C (Control-C)。

本文档是开源的。请通过提交问题或拉取请求帮助改进它。

我们正在进行一项关于 Prometheus 中 OTLP 资源属性的调查，请参与！

入门