原始网页:https://www.cockroachlabs.com/docs/stable/monitor-cockroachdb-with-prometheus.html


CockroachDB为集群中的每个节点生成详细的时间序列度量标准。 本小节将展示将这些指标导入Prometheus中的过程,后者是一个用于存储、聚合和查询时间序列数据的开源工具。同时,还将展示了如何将GrafanaAlertmanager连接到Prometheus以获得多样的数据可视化和通知功能。

TIPS: 查看更多监控选项,可以阅读Monitoring and Alerting

前言

Step 1: 安装Prometheus

prometheus --version

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:21:40
  go version:       go1.10

Step 2: 配置Prometheus

获取CockroachDB对应的Prometheus配置文件

wget https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/prometheus.yml -O prometheus.yml

检查配置文件时可看到文件设置了每10秒采集一次单个非安全的本地节点的时间序列度量标准数据:

自定义配置文件

场景 需要修改的配置
本地多节点集群 将每个节点的地址以'localhost:<http-port>'格式添加到targets配置项中。
生产环境集群 配置每个节点的地址,以'localhost:<http-port>'格式添加到targets配置项中。同时确保机器的网络配置允许TCP连接到监视端点的端口。
安全集群 scheme: 'http'替换为scheme: 'https'

获取配置规则

创建rules文件夹,下载聚合规则报警规则,存放到该文件夹中。

mkdir rules
cd rules
wget -P rules https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/rules/aggregation.rules.yml
wget -P rules https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/rules/alerts.rules.yml

启动Prometheus

prometheus --config.file=prometheus.yml

INFO[0000] Starting prometheus (version=1.4.1, branch=master, revision=2a89e8733f240d3cd57a6520b52c36ac4744ce12)  source=main.go:77
INFO[0000] Build context (go=go1.7.3, user=root@e685d23d8809, date=20161128-10:02:41)  source=main.go:78
INFO[0000] Loading configuration file prometheus.yml     source=main.go:250
INFO[0000] Loading series map and head chunks...         source=storage.go:354
INFO[0000] 0 series loaded.                              source=storage.go:359
INFO[0000] Listening on :9090                            source=web.go:248
INFO[0000] Starting target manager...                    source=targetmanager.go:63

Step 4: 向Alertmanager发送通知

积极的监控可以帮助用户更早地发现问题,同时在发生干预和调查的事件时将情况及时通知用户,是十分有必要的。在Step 2下载了CockroachDB相关的报警规则,接下来需要下载、配置和启动Alertmanager

alertmanager --version

alertmanager, version 0.15.0-rc.1 (branch: HEAD, revision: acb111e812530bec1ac6d908bc14725793e07cf3)
  build user:       root@f278953f13ef
  build date:       20180323-13:07:06
  go version:       go1.10
alertmanager --config.file=simple.yml

Step 5: 可视化Grafana指标

Prometheus能够简单地绘制指标,而Grafana是另一个功能更强大的可视化工具,可以方便地与Prometheus集成。

配置项 配置值
Name Prometheus
Default True
Type Prometheus
Url http://:9090
Access Direct
# runtime dashboard: node status, including uptime, memory, and cpu.
wget https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/runtime.json

# storage dashboard: storage availability.
wget https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/storage.json

# sql dashboard: sql queries/transactions.
wget https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/sql.json

# replicas dashboard: replica information and operations.
wget https://raw.githubusercontent.com/cockroachdb/cockroach/master/monitoring/grafana-dashboards/replicas.json