Performance Tuning

This document outlines how to identify performance bottlenecks and describes typical performance tuning scenarios both inside and outside the YMatrix database.

1 How to Identify Performance Bottlenecks?

When a business SQL query becomes slow or you want to improve the performance of a SQL query, first confirm the general direction of the issue:

Is the entire server system slowing down:

Determine this by analyzing server resource usage.

Or is only a specific SQL query slowing down:

Use the Query Monitor feature in the YMatrix graphical interface.
Or observe real-time slow queries via the Overview page in Grafana/Prometheus monitoring.

Once a slow SQL query is identified, follow the steps below for investigation and analysis:

First, determine whether the performance bottleneck exists outside the database (server resources, business changes, etc.).
If not due to external reasons, then analyze the internal database (statistics update, data skew, query plan analysis, etc.).

1.1 Identifying External Database Issues

1.1.1 Server Resource Analysis

Note!
There are various methods to collect information. Here we detail only one or two. If you have other preferred commands, you can use them for collection and analysis.

Category	Information	Collection	Analysis
Server Resources	CPU Usage	Via YMatrix GUI - Cluster Management - Metrics View: • CPU usage per node Or use the `top` command to collect detailed CPU info: • User CPU (us) • System CPU (sy) • Idle CPU (id)	• When CPU usage is high, determine whether it is user CPU or system CPU • If user CPU is high, it indicates high CPU usage by a program. Check the efficiency of the code execution • If system CPU is high, check whether other server resources (disk I/O, memory, network, etc.) are insufficient
	Memory & Virtual Memory	Via YMatrix GUI - Cluster Management - Metrics View: • Memory usage per node Or use `vmstat` to view memory details: • si (memory swapped in per second) • so (memory swapped out per second)	If `si` and `so` are consistently non-zero, it indicates insufficient memory and heavy use of virtual memory, leading to performance degradation
	Disk I/O (Read/Write Rate)	Via YMatrix GUI - Cluster Management - Metrics View: • Disk I/O per node Or use `iostat -x` to view disk I/O details: • %util (percentage of time spent doing I/O) • %iowait (percentage of CPU time waiting for I/O)	• Disk I/O speed, not size, affects performance. If %util is near 100%, it means too many I/O requests and the system is overloaded • If %iowait is high, it indicates a disk I/O bottleneck. Consider upgrading or replacing the disk array
	Network	Via YMatrix GUI - Cluster Management - Metrics View: • Network receive/send rate per node Or use `sar -n DEV 1 2` to view network details: • rxkb/s (received data per second in KB) • txkb/s (transmitted data per second in KB)	Compare rxkB/s with total network bandwidth. If it is close to the total bandwidth, a network bottleneck exists
	Kernel Parameters		On Linux systems, check the corresponding kernel parameter files under `/proc/sys`, e.g., `cat overcommit_memory`

1.1.2 Business Change Confirmation

Confirm whether it is a newly launched business.
Check the DDL of cluster tables to confirm whether the issue is caused by online DDL. For example, adding an index to a large table consumes significant cluster resources and may interfere with normal access requests.

1.2 How to Investigate Internal Database Issues

Note!
Not all options in the table below are mandatory.

Category	Information	Collection	Analysis
Software Environment	OS Version	Use `uname -a`	Analyze whether the bottleneck is related to the OS version
	YMatrix Version	Use `SELECT version();`	Analyze whether the bottleneck is related to the YMatrix version
Cluster Info	Cluster Deployment Topology	• Via YMatrix GUI - Cluster Management • Or use `SELECT * FROM gp_segment_configuration;`	If failover occurs, a single physical node may host more Segments, potentially causing performance degradation
Database Info	Table Structure	• Via YMatrix GUI - Data Tables • Or use `\d+`	Confirm whether data skew is severe due to unreasonable distribution key settings
	Related Logs	Default log directory: `$HOME/gpAdminLogs` Database logs are in the data directory	Analyze logs if needed
	Slow Queries	Via YMatrix GUI - Query Monitor: check for blocking sessions	If slow queries exist, locate and analyze them
	Query Plan	Use `EXPLAIN SELECT...` to view the query plan	If the query plan cost is too high, analyze the path and root cause
	Save Environment	Use YMatrix's `minirepro` tool

2 Typical Internal Database Tuning Scenarios

Internal database tuning refers to tuning individual SQL statements.

2.1 Statistics Skew

Symptoms:
A table has undergone significant data changes (e.g., inserts, deletes). You need to run ANALYZE to collect accurate statistics to avoid incorrect query plans due to outdated stats, which can degrade performance.

Analysis Method:
Use EXPLAIN ANALYZE to check if statistics are inaccurate. If the row estimate in the query plan is far off, it indicates skew. Re-run:

=# ANALYZE <tablename>;

2.2 Data Skew

Symptoms:
Unreasonable distribution keys during initial data model design cause data to be concentrated on a few Segment instances, leading to skew. This creates a bottleneck in YMatrix's distributed architecture: query time depends on the slowest Segment + Master processing time.

Analysis Method:
Use the following command to check data distribution. If Segment data counts differ significantly, skew exists:

=# SELECT gp_segment_id, COUNT(*) FROM <tablename> GROUP BY gp_segment_id;

Solution:
If skew occurs, consider modifying the distribution key and redistributing data evenly:

=# ALTER TABLE <tablename> SET DISTRIBUTED BY (<newcolumn>);
=# ALTER TABLE <tablename> SET WITH (REORGANIZE=true);
=# ANALYZE <tablename>;

Attention!
Distribution key selection affects data distribution and query performance. Modifying it after business launch carries performance risks. Choose carefully during design. Refer to YMatrix DDL Best Practices.

2.3 Data Bloat

Symptoms:
Frequent updates/deletes without a proper cleanup cycle cause table bloat.

Solutions:

Drop historical (cold) partition subtables:
```
=# DROP TABLE <partition_tablename>;
```
Set up a reasonable cleanup cycle:
```
=# VACUUM <tablename>;
```
See Daily Cleanup.

For severely bloated tables, reorganize:

=# ALTER TABLE <tablename> SET WITH (REORGANIZE=true);
=# ANALYZE <tablename>;

2.4 Query Plan Analysis

If none of the above causes the SQL performance bottleneck, analyze the query plan and ask:

Does a specific operation take too long?
- Find the most time-consuming operation. For example, if an index scan is slower than expected, the index may be outdated and need rebuilding. Or adjust enable_<operator> parameters to force the optimizer to choose a different plan.
Are selective predicates applied early?
- Apply the most selective filters early to reduce rows passed up the plan tree. If the optimizer misestimates selectivity, collect more stats on relevant columns or reorder the WHERE clause.
Is the join order optimal?
- When joining multiple tables, ensure the optimizer chooses the most selective join order. If not, set join_collapse_limit=1 and use explicit JOIN syntax to force the order. Or collect more stats on join columns.
Does the optimizer selectively scan partitioned tables?
- If using partitioning, ensure only relevant child tables are scanned. Parent table scans should return 0 rows.
Are hash aggregations and joins used?
- Hash operations are usually faster than other joins or aggregations, as they can be done in memory without disk I/O.
Does enabling/disabling vectorization significantly affect performance?
- The vectorized execution engine improves performance for sequential operations like expressions, filters, and aggregations. For random operations like sorting or hashing, the improvement is limited.

For more on query plans, see Understanding Query Plans.

简体中文