YMatrix Architecture
This document introduces the overall architecture and related concepts of YMatrix.
Constrained by the instability, high cost, low performance, and high latency of stacked complex data technology stacks, enterprises find it difficult to maximize their benefits.
To reduce the complexity of the data ecosystem, YMatrix has designed a simple architecture with hyper-convergence genes, integrating computing, storage, and network resources into a unified system. It is based on a massively parallel processing (MPP) system and conforms to the characteristics of a microkernel architecture.
This architecture is flexible and adaptable to multiple scenarios. It is not only friendly to IoT time series scenarios, but also supports traditional analytical data warehouse environments and business intelligence (BI) work.
What are the advantages of the YMatrix architecture compared to complex data technology stacks?
Replacing traditional data technology stacks with a hyper-converged architecture may seem like a daunting task. So why do we need to do this?
In fact, regardless of the scenario, fully adopting a hyper-converged architecture can benefit many businesses by providing a unified data foundation for their complex IT systems, including smart connected vehicles, industrial internet, smart manufacturing, smart cities, energy, finance, pharmaceuticals, and more.
Compared to complex data technology stacks like the Hadoop ecosystem, the YMatrix architecture offers the following advantages:
- Hyper-convergence
- Robustness: A complex technology stack typically consists of N separate data processing systems. Assuming the probability of failure for any single component is P, the stability of the entire system can be approximated as (1-P)^N, meaning that each additional component significantly reduces stability. The hyper-converged architecture, with only one system, is naturally the most stable and robust.
- Cost-effectiveness: Due to its hyper-converged nature, YMatrix can consume and manage data within a single system without needing to transmit it across multiple distributed systems, thereby avoiding the need for data to be stored across multiple systems. Physical hardware requirements, such as disks, are minimal, resulting in low storage costs.
- Timeliness: In a hyper-converged architecture, data does not need to be transmitted across multiple systems, resulting in low latency and high timeliness.
- Simplified management: The hyper-converged solution makes the entire data ecosystem easier to manage, requiring no expertise in multiple product technologies or programming languages—only SQL knowledge is needed to operate it.
- High availability
- In the event of a few nodes failing, YMatrix's state data management service can automatically perform node failover without human intervention, ensuring business transparency. This reduces your labor costs and lowers human error risks.
- Rich toolchain ecosystem
- Compatible with the Postgres/Greenplum ecosystem. Covers various scenarios such as data migration, writing, performance testing, backup, and recovery.
- Supports standard SQL
- Supports SQL: 2016 standard, covering data types, scalar expressions, query expressions, character sets, data allocation rules, set operators, and more.
- Full support for ACID transactions
- Ensures data integrity and consistency, avoiding complex error checks and handling at the user level, and reducing your operational burden.
Hyper-Converged Architecture Diagram
Compared to databases with other architectures, YMatrix's hyper-convergence is manifested in the integration of multiple data types and data operations, enabling high-performance support for multiple data types + multiple scenarios within a single database. In terms of YMatrix's internal architecture, it features a microkernel. On top of the common core components, different storage and execution engine combinations are provided to meet the needs of various business scenarios, enabling different microkernels to achieve targeted improvements in write, storage, and query performance.
The following diagram illustrates the composition and functions of YMatrix's internal hyper-converged architecture:
_1696644131.png)
The following sections provide a detailed overview of the components of the YMatrix hyper-converged architecture.
- Common Core Components
These primarily refer to shared resources within the database, such as memory management, network communication protocols, and basic data structures.
- Storage Engines and Execution Engines
These are the combinations of storage engines and execution engines that can be selected when creating tables in YMatrix under different scenarios. Each combination can form a microkernel.
- Optimizer
Converts an SQL string into a query plan and generates the best plan based on the capabilities provided by the selected underlying storage engine.
- Logging, Transactions, Concurrency, Lock Management, Snapshots
These are standard components within the YMatrix kernel that provide generic functionality such as concurrency control, transaction mechanisms, and fault recovery.
- SQL
This refers to the standard SQL interface between YMatrix and the client.
- Authentication, Roles, Auditing, Encryption, Monitoring, Backup, Recovery, High Availability
These are some other common database features supported by YMatrix.
Database Architecture Diagram
YMatrix's high-level database architecture is based on the classic MPP (massively parallel processing) database technology architecture with some enhancements.
The following diagram describes the core components that make up a YMatrix database system and how they work together:
_1693302582.png)
The following sections provide a detailed introduction to the various components of the YMatrix database system and their functions.
- Master Node
- Responsible for establishing and managing session connections with clients.
- Responsible for parsing SQL statements and forming query plans (Query Plan).
- Distributes query plans to Segments, monitors query execution processes, and collects feedback results to return to clients.
- The Master does not store business data; it only stores the Data Dictionary, which is the collection of definitions and attributes for all data elements used in the system.
- In a cluster, only one Master is allowed, and a primary-standby configuration can be adopted, with the standby node referred to as Standby.
- Data Nodes (Segments)
- Responsible for storing and distributing the execution of SQL queries.
- The key to achieving optimal performance with YMatrix lies in evenly distributing data and workloads across a large number of Segment nodes with identical capabilities, enabling all Segment nodes to begin working on a task simultaneously and complete their tasks concurrently.
- Client (Client)
- This term is used as a generic term to refer to any device, client, or application capable of accessing the database.
- MatrixGate
- MatrixGate, abbreviated as mxgate, is YMatrix's high-speed streaming data write tool. For more information, see mxgate.
- Network Layer (Interconnect)
- Refers to the network layer in the database architecture, which handles inter-process communication between Segments and the network infrastructure supporting such communication.
- State Data Management Service (Cluster Service)
- The Cluster Service ensures high availability of the database by managing node state information. YMatrix uses an etcd cluster to implement this service: when a database node fails, etcd retrieves the node state data stored within itself, identifies the currently healthy node as the new primary node, and promotes this node to ensure the availability of the entire cluster.
For example, if the Master node fails, its Standby node is promoted to Master; if the Standby node fails, it has no impact on the overall cluster. Similarly, if the Primary node fails, its Mirror node is promoted to Primary; if the Mirror node itself fails, it has no impact on the overall cluster. For more information, see Failure Recovery.