2021 - 5

mysql tens of millions of fuzzy query optimization: full text search

The first thing that comes to mind when using fuzzy query is Like

However, using Like %language% will invalidate the index. When the data volume reaches millions, the Like query efficiency will be very low.

The following is a tens-million-level database query:

mysql> select count(*) from book;
+----------+
| count(*) |
+----------+
| 10134213 |
+----------+
1 row in set (11.77 sec)
mysql> select * from book where name like "%script%";
+----------------+-----------------+------+------+ ------+--------+---------------------+---------------- ---------+
| isbn | name | type | hot | num | status | create_time | update_time |
+----------------+-----------------+------+------+ ------+--------+---------------------+---------------- ---------+
| 110-110-122-31 | Cao Yu's script selection | Fake | 0 | 5 | 0 | 2021-04-29 14:19:43 | 2021-04-29 14:19:43 |
+----------------+-----------------+------+------+ ------+--------+---------------------+---------------- ---------+
1 row in set (15.08 sec)

mysql> explain select * from book where name like "%script%";
+----+-------------+-------+------------+-------+-- -------------+------+---------+------+----------+- ---------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+-- -------------+------+---------+------+----------+- ---------+-------------+
| 1 | SIMPLE | book | NULL | ALL | NULL | NULL | NULL | NULL | 10355497 | 11.11 | Using where |
+----+-------------+-------+------------+-------+-- -------------+------+---------+------+----------+- ---------+-------------+
1 row in set, 1 warning (0.01 sec)

You can consider full-text search to optimize the query:

Before MySQL 5.7.6, full-text index only supported English full-text index and did not support Chinese full-text index. It was necessary to use a word segmenter to preprocess the Chinese paragraphs and split them into words, and then store them in the database.
Starting from MySQL 5.7.6, MySQL has a built-in ngram full-text parser to support Chinese, Japanese, and Korean word segmentation.

ngram

An ngram is a sequence of n consecutive words in a text.
The ngram full-text parser can segment text into words. Each word is a sequence of n consecutive words.
For example, use ngram full-text parser to segment "Gong Xi Fa Cai":

n=1: "Gong", "Xi", "Fa", "Wealth"
n=2: "Congratulations", "Hifa", "Get rich"
n=3: "Gong Xi Fa", "Xi Fa Cai"
n=4: "Congratulations on getting rich"

The global variable ngram_token_size is used in MySQL to configure the size of n in ngram. Its value range is 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum number of characters of the word to be queried. If you need to search for words, set ngram_token_size to 1. With the default value of 2, no results will be obtained when searching for a single word. Because Chinese words have at least two Chinese characters, it is recommended to use the default value 2.

  • method one
    In modifying the MySQL configuration file my.ini, add a line of ngram_token_size parameter settings at the end:

    ngram_token_size=2
  • Method Two
    When using the startup command mysqld, pass the parameters as follows:

    mysqld --ngram_token_size=2

Create full-text index

alter table `book` add fulltext index fulltext_name(`name`) WITH PARSER ngram;

Inquire

mysql> select * from book where name like "%script%";
+----------------+----------+--------+------+---- --+--------+---------------------+---------------- -----+
| isbn | name | type | hot | num | status | create_time | update_time |
+----------------+----------+--------+------+---- --+--------+---------------------+---------------- -----+
| 110-120-119-15 | Lua Script | Programming | 4 | 5 | 0 | 2021-04-01 00:00:00 | 2021-04-01 00:00:00 |
+----------------+----------+--------+------+---- --+--------+---------------------+---------------- -----+
1 row in set (15.30 sec)

mysql> SELECT * FROM book WHERE MATCH (`name`) AGAINST ("script");
+----------------+----------+--------+------+---- --+--------+---------------------+---------------- -----+
| isbn | name | type | hot | num | status | create_time | update_time |
+----------------+----------+--------+------+---- --+--------+---------------------+---------------- -----+
| 110-120-119-15 | Lua Script | Programming | 4 | 5 | 0 | 2021-04-01 00:00:00 | 2021-04-01 00:00:00 |
+----------------+----------+--------+------+---- --+--------+---------------------+---------------- -----+
1 row in set (0.01 sec)

mysql> explain SELECT * FROM book WHERE MATCH (`name`) AGAINST ("script");
+----+-------------+------+------------+--------- -+---------------+----------+----------+-------+--- ---+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------+------------+--------- -+---------------+----------+---------+-------+--- ---+----------+----------------------------------+
| 1 | SIMPLE | book | NULL | fulltext | ft_index | ft_index | 0 | const | 1 | 100.00 | Using where; Ft_hints: sorted |
+----+-------------+------+------------+--------- -+---------------+----------+---------+-------+--- ---+----------+----------------------------------+
1 row in set, 1 warning (0.01 sec)