Interpretation of MatrixDB monitoring parameters

1. Overview section

Overview section The overview section displays the overall operating status of the cluster, including:

Parameter name Description Reference Alarm Threshold
Cluster status Cluster node status, including:
0: Normal
1: No Standby
2: No Mirror
10: Distribution unbalanced (Some nodes are down and restored, and the master-slave role is not rebalanced)
11: There are master-slave asynchronous nodes (Some mirror nodes are not synchronous with primary)
12: Only Master (The cluster only starts the Master node, usually used during diagnosis)
20: Segment downtime (There is an unavailable Segment node, the cluster is not available)
Segment downtime is a serious event, and an alarm is required
Runtime Includes MatrixDB's run time since startup and master host operating system run time
Version Version of MatrixDB
Connection status Connection status displays the number of connections in the database system, including: total number of connections, number of blocked connection queries, number of idle connections, number of idle connections in transactions
Slow query In the current system, the number of queries that have been executed for more than 1 day greater than 0 means that there are particularly slow queries and an alarm is required
Transactions Statistics on transaction submission and rollback count Rollback alarm threshold can be set
Disk usage Disk usage and remaining space for master nodes and segment nodes Alarms are recommended to set directly in node_exporter
Node status State of each node, including:
0: UP (Normal)
10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced)
11: Resync (Master-slave synchronization)
20: Down (Downtime)
11 and 20 need to add alarms

2. Database Performance section

Database Performance Section The Database Performance section demonstrates database performance, including:

Parameter name Description Reference Alarm Threshold
Page Hit Ratio Hit Buffer Ratio when reading a data page
Temp Size Temp file usage
Deadlocks Number of deadlocks Automatically greater than 0
Checksum Failures Number of data page verification failure Automatically greater than 0
Sessions Per Database Number of connections per database
Page Cache Hit blks_hit: Number of hit caches when reading data pages
blks_read: Number of times cache missed and disks to be read
Rows Read Query read and return tuple number
Checkpoints checkpoints trigger times, including:
checkpoints_req: manual trigger
checkpoints_timed: periodic trigger
Replication Latency Master-slave replication delay, unit ms
write_lag: delay in log writing to mirror file cache
flush_lag: delay in log flushing to mirror disk
replay_lag: delay in log playback completion
Top Segment: All nodes write_lag+flush_lag+replay_lag delay and maximum value
Alarm threshold can be set according to the situation
Rows Insert/Update/Delete Rows Insert: Insert number of rows
Rows Update: Updating number of rows
Rows Delete: Delete number of rows
Checkpoint buffers Dirty page writing statistics
buffers_checkpoint: checkpoint number of dirty pages
buffers_clean: bgwriter number of dirty pages
buffers_backend: backend process writing number of dirty pages
Top 10 Replication Lag Size Statistics of the delay amount of the Top 10 nodes, the calculation method is the difference between the sent lsn and the replay lsn The alarm threshold can be set according to the situation

3. Storage section

Storage section Storage section displays storage-related statistics, including:

Parameter name Description Reference Alarm Threshold
Top 10 Databases Top 10 Databases
Top 10 Users Top 10 User Generated Data Size
Top 10 Aging Database Top10 Database Age (transaction ID less than this value is replaced by Frozen)
Top 10 Big Tables Top 10 Database Table Size
Top 10 Big Partitions Top 10 Partition Table Size
Top 10 Growth Today Top 10 table size increments on the day
Top 10 Growth Last 7 Days Top 10 table size increments in the past 7 days