2017-08-01

MSYS2 から Bash on Windows + Cmder に移って、その後 Bash on Windows + wsltty に変えた

Bash on Windows が beta 外れたから MSYS2 から BoW + Cmder な環境に移った。移ってたんだけど、コピペが Ctrl-C, Ctrl-V とかありえないショートカットキーだったり、まぁそれは変えればいいんだけど、そもそも重いしウインドウをリサイズすると画面真っ白になったりするから mintty/wsltty に変えた。

github.com

wsltty は WSL 用の mintty で、タブは使えないけどこれまで MSYS2 で mintty 使ってたこともあって操作に慣れ親しんでるし、変えた設定といえば、フォントを Ricty Diminished Discord に変えたのと、 bash のプロンプトに色が付かなかったので $TERM を xterm から xterm-256color に変えたくらい。

f:id:tyru:20170801222445p:plain

マルチプレクサも特に使わない。自分の場合そういうのを使うと意味なく無駄に開いてしまう悪い癖があって、ただでさえ少ない脳のリソースが無駄に割かれてしまう。

というわけでまだ切り替えて一日目だけど、これで大分満足してしまった。

2017-07-30

Consul を Prometheus と連携させてみる

Prometheus Consul

前回まで Prometheus による監視や、Alertmanager によるアラート通知などを書いてきた。

tyru.hatenablog.com

ただ設定ファイルを見ると分かる通り、監視対象のノードが増える度に手動で設定ファイルに追記していくのはとても面倒。自動化したい。幸いにも Prometheus には監視対象を連携システムから取得するための機能が豊富にある。今回は Consul と連携してノードが増える度に自動で監視対象も増やせるようにする。

サーバとエージェントを同じノードで動かすのは無理のようなので Consul server と Consul agent は別ノードにインストールします。前回に引き続いてこんな構成になってます。

監視サーバ（ホスト名：promhost）
- Prometheus server, Alertmanager, node_exporter, Consul server
監視対象のサーバ（ホスト名：targethost）
- node_exporter, Consul agent

Consul とは？

散々解説はあると思うのでそちらで（詳しい解説ができないだけ）。以下は素晴らしい連載記事。

gihyo.jp

あと公式の Getting Started も分かりやすくていい。

prometheus.io

自分の理解だと

サーバ・クライアント型
KVS 機能を持っていて以下の情報等を格納している
サーバに接続しているノード一覧の情報
どんなサービスがどこで稼働しているか（サービスディスカバリ）
サービスの検出のために監視機能はあるがアラート機能はない

サービスディスカバリによって得たノード一覧とかの情報を KVS に蓄えているので、それを Prometheus に渡してアラートはそいつにやってやろうみたいなノリです（実際は Prometheus から情報を Pull するので逆ですが）。

Consul のインストール

Consul のバイナリは Prometheus と違い GitHub の releases には置いてない（ソースコードのみ）。公式サイトのダウンロードページからダウンロードしてくる。 zip で配布されているので、Linux 上で展開するなら unzip をインストールする必要があるかもしれない。

$ sudo yum install unzip  # なければインストール
$ curl -LO https://releases.hashicorp.com/consul/0.9.0/consul_0.9.0_linux_amd64.zip
$ unzip consul_0.9.0_linux_amd64.zip

展開すると consul というバイナリが一つだけカレントディレクトリに展開される。これを /opt/consul/consul としてインストールする。

$ sudo mkdir /opt/consul
$ sudo mv consul /opt/consul/
$ sudo chmod 755 /opt/consul
$ sudo chown -R root:root /opt/consul

バイナリのインストール終わり。このバイナリ単体でサーバとエージェント両方兼ねている。ちなみにインストールの簡単さから想像できる通り Consul も Go 製。

あとは systemd で起動できるように設定ファイルを作る。

Consul エージェント用

エージェント用のノードにインストールするファイル。まだエージェントとクライアントの違いが分かってない… ともかくまずは設定ファイル用ディレクトリを作っておく。

$ sudo mkdir /etc/consul.d

次に設定ファイルを作る。

/etc/systemd/system/consul-agent.service

[Unit]
Description=Consul Agent
After=network.target

[Service]
Type=simple
EnvironmentFile=-/etc/default/consul-agent
ExecStart=/opt/consul/consul agent $OPTIONS
PrivateTmp=true

[Install]
WantedBy=multi-user.target

/etc/default/consul-agent

-data-dir はデータディレクトリ。 -node は Consul クラスタ上で重複しない名前。自分はホスト名を設定した（これ後でエージェントごとにインストールする時自動生成したいなぁ）。 -config-dir は設定ファイル用のディレクトリ。 -join は Consul サーバのIPアドレスかホスト名。

OPTIONS="-data-dir=/var/lib/consul-agent -node=targethost -config-dir=/etc/consul.d/ -join=promhost"

ちなみに -node に指定する値を $(hostname) にしたらまんま $(hostname) ってノード名で追加されてしまった。 systemd の EnvironmentFile に指定するのってシェルスクリプトじゃないのね…（SysVinit ならいけたような？）

OPTIONS="-data-dir=/var/lib/consul-agent -node=$(hostname) -config-dir=/etc/consul.d/ -join=promhost"

で、こんな感じになった。

$ sudo /opt/consul/consul members
Node         Address              Status  Type    Build  Protocol  DC
promhost     xxx.xxx.xxx.xxx:8301  alive   server  0.9.0  2         dc1
$(hostname)  zzz.zzz.zzz.zzz:8301  alive   client  0.9.0  2         dc1

ノード名修正したら今度はこんな風になった。 Status が failed になってるけど一覧から削除されてない。

Node         Address              Status  Type    Build  Protocol  DC
$(hostname)  xxx.xxx.xxx.xxx:8301  failed  client  0.9.0  2         dc1
promhost     yyy.yyy.yyy.yyy:8301  alive   server  0.9.0  2         dc1
targethost   xxx.xxx.xxx.xxx:8301  alive   client  0.9.0  2         dc1

こういう時は consul force-leave を行うと left 状態にはできる（failed と left の違いはこちら）。ただそれでも一覧からは削除されない。

$ sudo /opt/consul/consul force-leave '$(hostname)'
$ sudo /opt/consul/consul members
Node         Address              Status  Type    Build  Protocol  DC
$(hostname)  xxx.xxx.xxx.xxx:8301  left    client  0.9.0  2         dc1
promhost     yyy.yyy.yyy.yyy:8301  alive   server  0.9.0  2         dc1
targethost   xxx.xxx.xxx.xxx:8301  alive   client  0.9.0  2         dc1

公式サイトによると、

Q: Are failed or left nodes ever removed?

To prevent an accumulation of dead nodes (nodes in either failed or left states), Consul will automatically remove dead nodes out of the catalog. This process is called reaping. This is currently done on a configurable interval of 72 hours. Reaping is similar to leaving, causing all associated services to be deregistered. Changing the reap interval for aesthetic reasons to trim the number of failed or left nodes is not advised (nodes in the failed or left state do not cause any additional burden on Consul).

Frequently Asked Questions - Consul by HashiCorp

72時間ごとに failed や left になったノードを一覧から削除する reaping と呼ばれる処理が実行されるとのこと。一覧に残ってて見た目上気になるからこの間隔を変えるのは推奨されないとまで書いてある。というわけで気にしないことにする。ただ consul-server を再起動したら $(hostname) がいなくなってたので、もしかしたら再起動すると自動的に reaping が行われるのかもしれない。

systemd で起動

$ sudo systemctl daemon-reload
$ sudo systemctl enable consul-agent  # 今回は検証のためなので自分の環境ではやらなかった
$ sudo systemctl start consul-agent

Consul サーバ用

サーバ用のノードにインストールするファイル。

/etc/systemd/system/consul-server.service

[Unit]
Description=Consul Server
After=network.target

[Service]
Type=simple
EnvironmentFile=-/etc/default/consul-server
ExecStart=/opt/consul/consul agent -server $OPTIONS
PrivateTmp=true

[Install]
WantedBy=multi-user.target

/etc/default/consul-server

-data-dir はデータディレクトリ。 -bootstrap-expect は Consul サーバの数。ネットワークインターフェースが複数ある環境では -bind=IPアドレス も付けないといけないらしいです。

ちなみにサーバー側のサービス定義も公開するならエージェントと同じく -config-dir=/etc/consul.d/ を付ければ良い。サービス定義の設定ファイルに関しては後述。

OPTIONS="-bootstrap-expect=1 -data-dir=/var/lib/consul-server"

systemd で起動

$ sudo systemctl daemon-reload
$ sudo systemctl enable consul-server  # 今回は検証のためなので自分の環境ではやらなかった
$ sudo systemctl start consul-server

Consul サービス設定

※サービス設定は多分 Prometheus で監視するならいらない。理由は後述。

先ほど空の /etc/consul.d ディレクトリを作っておいたが、その中に Web サーバなどのサービス用の設定ファイルを作っていく。サーバ側で管理するんじゃなくノード側に自分を監視する設定を置くのは新鮮。あと設定ファイルが JSON なのも割と新鮮。以下は第6回の記事からの引用。

/etc/consul.d/web.json

{
  "service": {
    "name": "web",
    "tags": [ "nginx" ],
    "port": 80,
    "check": {
      "script": "curl http://127.0.0.1:80/consul.html >/dev/null 2>&1",
      "interval": "10s",
      "timeout": "5s"
    }
  }
}

今回のために nginx 用意するのもだるかったので前回インストールした node_exporter が HTTP でメトリクス情報を公開してるので、それを監視することにする（というか Prometheus はここを監視してるのですでにアラートは行く状態になってるけど、今回はテストなので）。テストのつもりで追加しましたが、Prometheus に監視させるために必要でした。理由は後述。

設定ファイルはこんな感じ。

/etc/consul.d/node_exporter.json

{
  "service": {
    "name": "node_exporter",
    "tags": ["node_exporter"],
    "port": 9100
  }
}

まだ詳しく分かってないけどこれでサービスの定義はできる。監視もするなら

{
  "service": {
    "name": "node_exporter",
    "tags": ["node_exporter"],
    "port": 9100,
    "check": {
      "script": "curl http://targethost:9100 >/dev/null 2>&1",
      "interval": "10s"
    }
  }
}

みたいにするって Getting Started でも書いてあったんだけど動かなかった。

ちなみに失敗した際のログ

$ sudo systemctl status -l consul-agent.service
● consul-agent.service - Consul Agent
   Loaded: loaded (/etc/systemd/system/consul-agent.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since 日 2017-07-30 18:39:58 JST; 1s ago
  Process: 4575 ExecStart=/opt/consul/consul agent $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 4575 (code=exited, status=1/FAILURE)

 7月 30 18:39:58 targethost consul[4555]: 2017/07/30 18:39:58 [INFO] Exit code:  0
 7月 30 18:39:58 targethost systemd[1]: Started Consul Agent.
 7月 30 18:39:58 targethost systemd[1]: Starting Consul Agent...
 7月 30 18:39:58 targethost consul[4575]: ==> Starting Consul agent...
 7月 30 18:39:58 targethost consul[4575]: ==> Error starting agent: Failed to register service '': Check types that exec scripts are disabled on this agent
 7月 30 18:39:58 targethost systemd[1]: consul-agent.service: main process exited, code=exited, status=1/FAILURE
 7月 30 18:39:58 targethost systemd[1]: Unit consul-agent.service entered failed state.
 7月 30 18:39:58 targethost systemd[1]: consul-agent.service failed.

まぁ監視は Prometheus でやるからいいか、って思ったのでとりあえず今回はやらないことにした。

ちなみに無事設定がうまくいったらサーバ側で以下のコマンドで node_exporter がいるのを確認できるはずです。

$ sudo /opt/consul/consul catalog services
consul
node_exporter

Prometheus との連携

で、上でサービスの定義とか書いてみたんだけど、ぶっちゃけ Prometheus で監視するためには必要ない。なぜかというと Prometheus としては exporter が動いてるノード一覧を取得することが目的なので、サービスが定義されていなくても大丈夫（のはず）。ただオプションで監視対象のサービスを Prometheus で絞り込むことはできる（参照：/scrape_configs/consul_sd_config/services）。

…と思ってたんですが、どうやら監視対象はサービスで決まる様子。つまり services で絞り込んでやらないとデフォルトで提供される consul のサービスを見に行ってしまう（8300 は consul が RPC に使うポートなので Prometheus に見に行ってもらっても困るんですが…）。ということで上記で定義した node_exporter で絞り込むことにする。つまりこう。

scrape_configs:
  - job_name: 'consul_sd_configs'
    consul_sd_configs:
      - server: 'localhost:8500'
        services:
          - 'node_exporter'  # node_exporter というタグを持つサービスのみ監視

監視対象のホスト名を表示する

そのままだと http://<prometheus host>/targets に表示されるノードは IP アドレスで表示される。またアラートが来た時も同じ。

IP アドレスだとパッと見どのノードなのか分かりづらい。なので代わりにホスト名で表示する。

Controlling the instance label | Robust Perception

この記事を参考に書いた Consul 用の設定は以下の通り。

relabel_configs:
  - source_labels: [__meta_consul_node]
    regex:  '(.*)'
    target_label: __address__
    replacement: '${1}:9100'
  - source_labels: [__meta_consul_node]
    target_label: instance

…と思ったんだけどこれだとポートが 9100 固定になっている。それ以外のポートで動作させたり、この記事で書いたように複数の exporter を同じノード上で動作させることができない。というわけで以下の設定になった。

relabel_configs:
  - source_labels: [__meta_consul_node, __meta_consul_service_port]
    separator: ':'
    target_label: __address__
  - source_labels: [__meta_consul_node]
    target_label: instance

これでちゃんと Consul エージェント側で定義したポートを見に行ってくれる。

Prometheus が監視しにいく URL を見る方法は？

http://<Prometheus>/targets で見れる。

f:id:tyru:20170730224109p:plain

監視対象のポートが違うんだけど…

先ほどエージェント側に置いた /etc/consul.d/node_exporter.json のファイルの port をチェックすること。

Consul が公開しているサービスの情報

curl http://<Consul server>:8500/v1/catalog/service/<service> で見れる（例： = node_exporter）。 JSON 見づらいので jq コマンド必須。

$ curl http://<Consul server>:8500/v1/catalog/service/node_exporter | jq .

Consul が使うポート

参考に Consul が使うポートをこちらの記事から引用。

機能	TCP/UDP	ポート	説明
Server RPC	TCP	8300	Server が他の Agent からRPCのリクエストを受け付ける
Serf LAN	TCP & UDP	8301	LAN用のゴシッププロトコル。全 Agent 同士が使う
Serf WAN	TCP & UDP	8302	WAN用のゴシッププロトコル。Server 同士が使う
CLI RPC	TCP	8400	consulコマンド実行時にローカルの Agent との通信に使われる
HTTP API	TCP	8500	Client が HTTP リクエストを受け付ける
DNS	TCP & UDP	8600	Agent が DNSクエリを受け付ける

雑感

最近 Go 製のツールを構築してく中で思ったのは、Go 製のツールってあえて明示的に設定ファイルを指定する必要があったり、暗黙的に色々参照したりしない傾向があるような気がする（サンプル数少ないけど）。 Go の文化なのかな。

2017-07-29

Prometheus の Alertmanager（と Postfix）でメール通知

Prometheus Alertmanager

前回 Prometheus server と node_exporter を同じノード上にインストールしてグラフが取れてることを確認したりしました。なので今度はメール通知をやってみようと思う。メール通知するためには Alertmanager というアラートを出す専用のやつをインストールして Prometheus server と連携する必要がある。なので構成としては

監視サーバ（ホスト名：promhost）
- Prometheus server, Alertmanager, node_exporter（エージェント）
監視対象のサーバ（ホスト名：targethost）
- node_exporter（エージェント）

みたいな形でインストールする。というわけで前回インストールしたノードを監視サーバにしたてて、新たに別のノードにエージェントをインストールしました。 node_exporter のインストールは前回書いたので、Alertmanager のインストール方法をメモします。

Alertmanager のインストール

Prometheus server, node_exporter と同じく Go 製なのでインストールが簡単。前回にならって /opt/alertmanager/ にインストールする。

現在の最新バージョンは v0.8.0（その他のリリース）。

$ curl -LO https://github.com/prometheus/alertmanager/releases/download/v0.8.0/alertmanager-0.8.0.linux-amd64.tar.gz

$ tar xzf alertmanager-0.8.0.linux-amd64.tar.gz
$ sudo mv alertmanager-0.8.0.linux-amd64 /opt/alertmanager
$ sudo chmod 755 /opt/alertmanager
$ sudo chown -R root:root /opt/alertmanager

終わり。

Alertmanager の設定

/etc/alertmanager/alertmanager.yml（alertmanager の設定ファイル）を作成。 /global/smtp_require_tls に false を指定していることに注目。 TLS サポートを有効にしていないローカルの Postfix などで配送する場合はこの値を指定してやらないとメールが送れなかった。

参考
- Google グループ
- Consider a global config option for require_tls · Issue #433 · prometheus/alertmanager · GitHub

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_require_tls: false
  smtp_from: 'Alertmanager <alertmanager@localhost.localdomain>'

# The root route on which each incoming alert enters.
route:
  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h

  # A default receiver
  receiver: default

receivers:
- name: 'default'
  email_configs:
  - to: 'alertmanager@localhost.localdomain'

/etc/systemd/system/alertmanager.service（alertmanager 用 systemd 設定ファイル）を作成。

[Unit]
Description=Alertmanager for Prometheus
After=network.target

[Service]
Type=simple
EnvironmentFile=-/etc/default/alertmanager
ExecStart=/opt/alertmanager/alertmanager $OPTIONS
PrivateTmp=true
WorkingDirectory=/opt/alertmanager

[Install]
WantedBy=multi-user.target

/etc/default/alertmanager（systemd で参照する環境変数ファイル）を作成。ここで -storage.path を指定しないとカレントディレクトリの data ディレクトリにデータ用ディレクトリを作ろうとする。 systemd で動いてる場合は WorkingDirectory の指定がない場合は / で動くので /data ってディレクトリが作られてしまう。上の systemd 設定ファイルで WorkingDirectory 指定してあるのは -storage.path の他にもカレントディレクトリに何か作ってしまわないか心配になったので念のため。

OPTIONS="-config.file /etc/alertmanager/alertmanager.yml -storage.path /var/lib/alertmanager"

Prometheus に Alerting rules の定義

アラートをメール通知するためにはまだ足りない。アラートの条件はどこに書くかというと、Prometheus server の設定ファイル（前回の通りだと /etc/prometheus/prometheus.yml）に書く。

rule_files:
  - /etc/prometheus/alert.rules

こんな行を追加する。あと監視対象のノード追加。

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets:
        - 'localhost:9090'
        - 'promhost:9100'     # 監視サーバ (localhost:9100 でもいいけど)
        - 'targethost:9100'   # 監視対象サーバ

前回との差分は以下の通り。

--- /etc/prometheus/prometheus.yml.old  2017-07-29 22:53:32.319944850 +0900
+++ /etc/prometheus/prometheus.yml      2017-07-29 22:44:24.542404336 +0900
@@ -11,8 +11,7 @@

 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
 rule_files:
-  # - "first.rules"
-  # - "second.rules"
+  - /etc/prometheus/alert.rules

 # A scrape configuration containing exactly one endpoint to scrape:
 # Here it's Prometheus itself.
@@ -24,4 +23,7 @@
     # scheme defaults to 'http'.

     static_configs:
-      - targets: ['localhost:9090']
+      - targets:
+        - 'localhost:9090'
+        - 'promhost:9100'
+        - 'targethost:9100'

念のため全体も。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - /etc/prometheus/alert.rules

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets:
        - 'localhost:9090'
        - 'promhost:9100'
        - 'targethost:9100'

prometheus.yml を更新した後は、/etc/prometheus/alert.rules を作成。

# Alert for any instance that is unreachable for >5 minutes.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  LABELS { severity = "page" }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} down",
    description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
  }

# Alert for any instance that have a median request latency >1s.
ALERT APIHighRequestLatency
  IF api_http_request_latencies_second{quantile="0.5"} > 1
  FOR 1m
  ANNOTATIONS {
    summary = "High request latency on {{ $labels.instance }}",
    description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)",
  }

上記は公式ドキュメントまんま。監視対象の node_exporter が落ちたりノードが停止したりするとメールが飛ぶ設定（だがまだ足りない）。

Prometheus server と Alertmanager の連携

Prometheus server に Alertmanager を認識させるにはコマンドライン引数に -alertmanager.url を追加する。前回 /etc/default/prometheus を作ったのでそこに書く。

OPTIONS="-config.file=/etc/prometheus/prometheus.yml -storage.local.path=/var/lib/prometheus -web.console.libraries=/etc/prometheus/console_libraries -web.console.templates=/etc/prometheus/consoles -alertmanager.url=http://localhost:9093"

上記のように末尾に追加して再起動。

$ systemctl restart prometheus

これでメールが飛ぶようになったはず。

エラーメッセージへの対処

いくつかエラーメッセージに出くわしたのでその対処法を書く。

Error on notify: Cancelling notify retry due to unrecoverable error: parsing from addresses: mail: missing phrase

systemctl status -l alertmanager に出てたエラーメッセージ。

/global/smtp_from, /receivers/email_configs/to などに指定するメールアドレスが不正な場合に出ます。 smtp_from: 'Alertmanager <alertmanager@localhost.localdomain>' みたいに書くか smtp_from: alertmanager@localhost.localdomain みたいに書く必要があります（localhost だからって @ 以降も忘れずに）。

参考
- mail: missing phrase · Issue #624 · prometheus/alertmanager · GitHub

Humanity

Edit the world by your favorite way