Monitor Your Cardano Stake Pool Nodes Using The OFO Stack (Debian Buster)

We have been monitoring our nodes with the ELK Stack, Prometheus, and Beats since we started our stake pool. Performance and reliability is important to us, especially when it comes to our delegators returns. Beyond that, the Cardano network benefits from having professionally run stake pools with healthy nodes.

Although we were happy with that setup, there has been a number of concerns with the future viability of ELK as its new licensing makes it less than an open source platform. Wanting to support the excellent projects https://opensearch.org[OpenSearch] and https://fluentbit.io[Fluentbit] with another use case, and to simplify our monitoring solution we decided to share our notes on how to monitor your Cardano Stake Pool with an https://ofostack.org[OFO Stack].

The notes below will instruct you on how to monitor your Cardano Relay Node or Cardano Producer Node with an https://ofostack.org[OFO Stack] (OpenSearch, Fluentbit, OpenSearch Dashboards) on Debian Buster 10 and using https://caddyserver.com/[Caddy] as a reverse proxy with SSL.

WARNING: This guide does not intend to teach you how to harden your monitoring server or node infrastructure. You should take the proper precautions before placing into production. One important, but by no means exhaustive hardening measure is: place a firewall in front of all systems and restrict access to ports to only trusted systems.

Cardano Stake Pool Monitoring Server

The following steps should be performed on the freshly installed machine to be used as the Cardano Stake Pool monitoring server.

Update the Operating System

Run the following commands as root:

apt update -y
apt upgrade -y
apt dist-upgrade -y
apt autoremove -y
shutdown -r now

Configure Swap Space (Optional)

Run the following commands as root:

swapon --show
fallocate -l 10G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile swap swap defaults 0 0" >> /etc/fstab
swapon --show
shutdown -r now
swapon --show

Add OpenSearch User

Run the following commands as root:

adduser \
    --system \
    --shell /bin/bash \
    --gecos 'OpenSearch User' \
    --group \
    --disabled-password \
    --home /opt/opensearch \
    opensearch

Install Java 11

Run the following commands as root:

apt update
apt install -y openjdk-11-jdk

Switch to OpenSearch User

Run the following commands as root:

su - opensearch
cd ~

Install OpenSearch

Run the following commands as opensearch user:

wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.0.0/opensearch-1.0.0-linux-x64.tar.gz
tar -xvzf opensearch-1.0.0-linux-x64.tar.gz
rm opensearch-1.0.0-linux-x64.tar.gz

Create a new file ‘/lib/systemd/system/opensearch.service’ (as root) with the following contents:

[Unit]
Description=OpenSearch
Documentation=https://opensearch.org/docs/
Wants=network-online.target
After=network-online.target

[Service]
Type=forking
RuntimeDirectory=opensearch
#PrivateTmp=true

WorkingDirectory=/opt/opensearch/opensearch-1.0.0

User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch-1.0.0/opensearch-tar-install.sh

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

Edit /etc/sysctl.conf (as root) and add (if not in file) or modify the following line:

vm.max_map_count=262144

Run the following commands as root:

sysctl -p
systemctl daemon-reload
systemctl start opensearch.service
systemctl enable opensearch.service

Verify Succesful Startup

Run the following commands as root:

curl -XGET https://localhost:9200 -u 'admin:admin' --insecure
curl -XGET https://localhost:9200/_cat/nodes?v -u 'admin:admin' --insecure
curl -XGET https://localhost:9200/_cat/plugins?v -u 'admin:admin' --insecure

Switch to OpenSearch User

Run the following commands as root:

su - opensearch
cd ~

Install OpenSearch Dashboards

Run the following commands as opensearch user:

wget https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.0.0/opensearch-dashboards-1.0.0-linux-x64.tar.gz
tar -zxf opensearch-dashboards-1.0.0-linux-x64.tar.gz
rm opensearch-dashboards-1.0.0-linux-x64.tar.gz

Create file ‘/lib/systemd/system/opensearch-dashboards.service’(as root) with the following contents:

[Unit]
Description=OpenSeach-Dasboards
 
[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.0.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.0.0

[Install]
WantedBy=multi-user.target

Run the following commands as root:

systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

Install Caddy

Run the following commands as root:

wget https://github.com/caddyserver/caddy/releases/download/v2.3.0/caddy_2.3.0_linux_amd64.deb
dpkg -i caddy_2.3.0_linux_amd64.deb
rm caddy_2.3.0_linux_amd64.deb

Configure Caddy

Edit ‘/etc/caddy/Caddyfile’ and replace all contents with the following:

your_monitoring_server.hostname.com {
reverse_proxy localhost:5601
}

Allow UFW (Firewall)

WARNING: Please set the ports to values you are comfortable with based on your own personal risk tolerance. We suggest only allowing trusted systems to connect and to place an appropriately configured firewall in front of all production systems.

Run the following commands as root:

ufw allow proto tcp from any to any port 443
ufw allow proto tcp from any to any port 80
ufw allow proto tcp from any to any port 9200

Start Caddy Service

Run the following commands as root:

systemctl start caddy.service
systemctl enable caddy.service

Switch to OpenSearch User

Run the following commands as root:

su - opensearch
cd ~

Change Default Passwords

Generate Password Hashes

Run the following commands as opensearch user:

chmod +x ~/opensearch-1.0.0/plugins/opensearch-security/tools/hash.sh
~/opensearch-1.0.0/plugins/opensearch-security/tools/hash.sh

Edit Internal User Passwords

Replace hashes in file in appropriate places. Use the steps above this section to generate the appropriate hashes. Delete accounts that will not be needed or used.

~/opensearch-1.0.0/plugins/opensearch-security/securityconfig/internal_users.yml

Apply Security Settings

Run the following commands as opensearch user:

chmod +x ~/opensearch-1.0.0/plugins/opensearch-security/tools/securityadmin.sh
cd ~/opensearch-1.0.0/plugins/opensearch-security/tools/

./securityadmin.sh -cd ../securityconfig/ -icl -nhnv \
   -cacert ../../../config/root-ca.pem \
   -cert ../../../config/kirk.pem \
   -key ../../../config/kirk-key.pem

Cardano Stake Pool Node Monitoring Configuration

The following steps should be performed on your Cardano Stake Pool nodes.

Install Fluent-bit on Nodes

Run the following commands as root:

curl https://packages.fluentbit.io/fluentbit.key | sudo apt-key add -
echo 'deb https://packages.fluentbit.io/debian/buster buster main' > /etc/apt/sources.list.d/fluentbit.list
apt update -y
apt install -y td-agent-bit
systemctl enable td-agent-bit
systemctl start td-agent-bit

Configure Fluentd User

Login to the web interface with the administrative user at https://your_monitoring_server.hostname.com.

Create Index Pattern

Create an index pattern ‘system_metrics-*’.

Configure System Metrics Reporting

Edit ‘/etc/td-agent-bit/td-agent-bit.conf’ (as root) and replace with the following contents:

[SERVICE]
    flush        5
    daemon       Off
    log_level    info

    parsers_file parsers.conf
    plugins_file plugins.conf

    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    storage.metrics on

# Duplicate and edit this section for each network interface you want monitored
[INPUT]
    Name          netif
    Tag           node-name-example-cardano-producer
    Interval_Sec  1
    Interval_NSec 0
    Interface     enp1s0

[INPUT]
    name cpu
    tag  node-name-example-cardano-producer
    # Read interval (sec) Default: 1
    interval_sec 1

[INPUT]
    Name          disk
    Tag           node-name-example-cardano-producer
    Interval_Sec  1
    Interval_NSec 0

[INPUT]
    Name   mem
    Tag    node-name-example-cardano-producer
  
[INPUT]
    Name        tail
    Tag         node-name-example-cardano-producer
    Parser      cardano
    Path        /var/log/cardano/nodelog.log

[OUTPUT]
    Name  es
    Match *
    Host  your_monitoring_server.hostname.com
    Port  9200
    Logstash_Format True
    Logstash_Prefix system_metrics
    tls on
    tls.verify off
    Include_Tag_Key True
    Tag_Key Tag
    HTTP_User fluentbit
    HTTP_Passwd yourfluentbituserpasswordhere

[OUTPUT]
    name  stdout
    match *
systemctl restart td-agent-bit

Modify Node Configuration

Edit ‘/etc/cardano/mainnet-config.json’ (as root) and set the values as appropriate. An example working configuration is below:

{
  "ApplicationName": "cardano-sl",
  "ApplicationVersion": 1,
  "ByronGenesisFile": "mainnet-byron-genesis.json",
  "ByronGenesisHash": "5f20df933584822601f9e3f8c024eb5eb252fe8cefb24d1317dc3d432e940ebb",
  "LastKnownBlockVersion-Alt": 0,
  "LastKnownBlockVersion-Major": 3,
  "LastKnownBlockVersion-Minor": 0,
  "MaxKnownMajorProtocolVersion": 2,
  "Protocol": "Cardano",
  "RequiresNetworkMagic": "RequiresNoMagic",
  "ShelleyGenesisFile": "mainnet-shelley-genesis.json",
  "ShelleyGenesisHash": "1a3be38bcbb7911969283716ad7aa550250226b76a61fc51cc9a9a35d9276d81",
  "TraceBlockFetchClient": false,
  "TraceBlockFetchDecisions": false,
  "TraceBlockFetchProtocol": false,
  "TraceBlockFetchProtocolSerialised": false,
  "TraceBlockFetchServer": false,
  "TraceChainDb": false,
  "TraceChainSyncBlockServer": false,
  "TraceChainSyncClient": false,
  "TraceChainSyncHeaderServer": false,
  "TraceChainSyncProtocol": false,
  "TraceDNSResolver": false,
  "TraceDNSSubscription": false,
  "TraceErrorPolicy": false,
  "TraceForge": true,
  "TraceHandshake": false,
  "TraceIpSubscription": false,
  "TraceLocalChainSyncProtocol": false,
  "TraceLocalErrorPolicy": false,
  "TraceLocalHandshake": false,
  "TraceLocalTxSubmissionProtocol": false,
  "TraceLocalTxSubmissionServer": false,
  "TraceMempool": false,
  "TraceMux": false,
  "TraceTxInbound": false,
  "TraceTxOutbound": false,
  "TraceTxSubmissionProtocol": false,
  "TracingVerbosity": "MaximalVerbosity",
  "TurnOnLogMetrics": true,
  "TurnOnLogging": true,
  "defaultBackends": [
    "KatipBK"
  ],
  "defaultScribes": [
    [
      "FileSK",
      "/var/log/cardano/nodelog.log"
    ]
  ],
  "hasEKG": 12788,
  "hasPrometheus": [
    "0.0.0.0",
    12798
  ],
  "minSeverity": "Warning",
  "options": {
    "mapSubtrace": {
      "cardano.node.metrics": {
        "subtrace": "Neutral"
      }
    }
  },
  "rotation": {
    "rpKeepFilesNum": 10,
    "rpLogLimitBytes": 5000000,
    "rpMaxAgeHours": 24
  },
  "setupBackends": [
    "KatipBK"
  ],
  "setupScribes": [
    {
      "scFormat": "ScJson",
      "scKind": "FileSK",
      "scName": "/var/log/cardano/nodelog.log",
      "scRotation": null
    }
  ]
}

Change the log location ownership to the appropriate user. ‘cardano’ is the example user. Run the following commands as root:

mkdir /var/log/cardano
chown cardano:cardano /var/log/cardano
systemctl restart cardano-node.service

Edit ‘/etc/td-agent-bit/parsers.conf’ (as root) and add the following section:

[PARSER]
    Name         cardano
    Format       json
    Time_Key     at
    Time_Format  %Y-%m-%dT%H:%M:%S.%L
    Time_Keep    On

Run the following commands as root:

systemctl restart td-agent-bit

You’re Done

If everything went well you are now shipping your metrics to indices with the following format: ‘system_metrics-YYYY.MM.DD’ and can access them via an index pattern ‘system_metrics-*’. Have fun creating your Dashboards and visualizations!

Bonus Notes