[MySQL 源码] MySQL drop table(压缩表)效率与流程分析

 之前发生过一起连续drop压缩表,最后长时间等待信号量crash,线上alert log里的报错是:

OS WAIT ARRAY INFO: reservation count 36647199, signal count 34050225 --Thread 1331538240 has waited at row0purge.c line 680 for 950.00 seconds the semaphore: S-lock on RW-latch at 0xe60b60 '&dict_operation_lock' a writer (thread id 1383319872) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Last time read locked in file row0purge.c line 680 Last time write locked in file /home/jiyuan/rpmbuild/BUILD/tb-mysql-5.5.18/storage/innobase/row/row0mysql.c line 3213
从alert log来看,drop table应该长时间处于持有dict_sys->dict_operation_lock状态:
row_drop_table_for_mysql { …… /* Serialize data dictionary operations with dictionary mutex: no deadlocks can occur then in these operations */ if (trx->dict_operation_lock_mode != RW_X_LATCH) { /* Prevent foreign key checks etc. while we are dropping the table */ row_mysql_lock_data_dictionary(trx); locked_dictionary = TRUE; } …… }

 

 

从函数row_mysql_lock_data_dictionary可以看出,该事务除了加row_mysql_lock_data_dictionary的x锁,还会持有dict_sys->mutex
正常情况下,直到退出函数dict_sys->mutex时,才会调用row_mysql_unlock_data_dictionary来释放dict_sys->mutex和dict_operation_lock
而在purge线程中,row0purge.c line 680
row_purge_parse_undo_rec{
…… row_mysql_freeze_data_dictionary(trx); mutex_enter(&(dict_sys->mutex));
……
}
我开始尝试做重现,把sysbench做一个简单的修改,使其支持compressed table,这是很简单的工作,以下是patch,基于percona版本的sysbench
Index: tests/db/common.lua
===================================================================
--- tests/db/common.lua (revision 2462)
+++ tests/db/common.lua (revision 2467)
@@ -18,9 +18,15 @@
    end 

    i = table_id
+   
+   local comp_str=" "
+   if mysql_use_compress ~= 0 then
+     comp_str=" ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE="..mysql_use_compress
+   end

-   print("Creating table 'sbtest" .. i .. "'...")
-   if (db_driver == "mysql") then
+   print("Creating table 'sbtest" .. i)
+    
+    if (db_driver == "mysql") then
        query = [[
    CREATE TABLE sbtest]] .. i .. [[ (
           id INTEGER UNSIGNED NOT NULL ]] .. ((oltp_auto_inc and "AUTO_INCREMENT") or "") .. [[, 
@@ -28,7 +34,7 @@
           c CHAR(120) DEFAULT '' NOT NULL,
           pad CHAR(60) DEFAULT '' NOT NULL,
           ]] .. index_name .. [[ (id)
-          ) /*! ENGINE = ]] .. mysql_table_engine .. " MAX_ROWS = " .. myisam_max_rows .. " */"
+          ) /*! ENGINE = ]] .. mysql_table_engine ..comp_str.. " MAX_ROWS = " .. myisam_max_rows .. " */"

    elseif (db_driver == "drizzle") then
    query = [[
Index: tests/db/select_random_points.lua
===================================================================
--- tests/db/select_random_points.lua   (revision 2462)
+++ tests/db/select_random_points.lua   (revision 2467)
@@ -13,6 +13,11 @@
    db_connect()

    print("Creating table 'sbtest'...")
+   
+   local comp_str=" "
+   if mysql_use_compress ~= 0 then
+   comp_str=" ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE="..mysql_use_compress
+   end 

    if (db_driver == "mysql") then
       query = [[
@@ -22,7 +27,7 @@
           c CHAR(120) DEFAULT '' NOT NULL,
           pad CHAR(60) DEFAULT '' NOT NULL,
           PRIMARY KEY (id)
-        ) /*! ENGINE = ]] .. mysql_table_engine .. " MAX_ROWS = " .. myisam_max_rows .. " */"
+        ) /*! ENGINE = ]] .. mysql_table_engine ..comp_str.. " MAX_ROWS = " .. myisam_max_rows .. " */"

    elseif (db_driver == "oracle") then
       query = [[
Index: drivers/mysql/drv_mysql.c
===================================================================
--- drivers/mysql/drv_mysql.c   (revision 2462)
+++ drivers/mysql/drv_mysql.c   (revision 2467)
@@ -75,7 +75,8 @@
   {"mysql-ssl", "use SSL connections, if available in the client library", SB_ARG_TYPE_FLAG, "off"},
   {"myisam-max-rows", "max-rows parameter for MyISAM tables", SB_ARG_TYPE_INT, "1000000"},
   {"mysql-debug", "dump all client library calls", SB_ARG_TYPE_FLAG, "off"},
-  
+  {"mysql-use-compress", "whether to create a compress table or not, and this value equal to key_block_size", 
+  SB_ARG_TYPE_INT , "0"},
   {NULL, NULL, SB_ARG_TYPE_NULL, NULL}
 };
加上上面这个patch,我们就可以通过新选项–mysql-user-compress=4来指定一个key_block_size=4的sbtest表
不过多次尝试,都没有完全重现上述purge线程等待dict_operation_lock超时的现象,倒是如下的信号量等待常可以看到
–Thread 140431724488448 has waited at buf0flu.c line 1454 for 303.00 seconds the semaphore:
S-lock on RW-latch at 0x2bce158 ‘&buf_pool->page_hash_latch’
a writer (thread id 140431697413888) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0buf.c line 2482
可以看到,在innodb层整个drop table的过程中,其stack 如下:
#0 0x00000000008a8c2d in buf_buddy_free_low ()
#1 0x0000000000815234 in buf_LRU_block_remove_hashed_page ()
#2 0x0000000000815fb3 in buf_LRU_free_block ()
#3 0x0000000000817143 in buf_LRU_search_and_free_block ()
#4 0x00000000008187bc in buf_LRU_get_free_block ()
#5 0x000000000080bf0e in buf_page_get_gen ()
#6 0x00000000007fdc58 in btr_search_drop_page_hash_when_freed ()
#7 0x000000000084c489 in fseg_free_extent ()
#8 0x000000000084d284 in fseg_free_step ()
#9 0x00000000007ec4a9 in btr_free_but_not_root ()
#10 0x000000000081d50e in dict_drop_index_tree ()
#11 0x00000000007b6035 in row_upd_clust_step ()
#12 0x00000000007b69de in row_upd_step ()
#13 0x0000000000899b88 in que_run_threads ()
#14 0x000000000089a286 in que_eval_sql ()
#15 0x00000000007a5ce9 in row_drop_table_for_mysql ()
#16 0x00000000007918ae in ha_innobase::delete_table(char const*) ()
或者
Thread 3 (Thread 0x7f2531467700 (LWP 24352)):
#0 0x00000000008a8c2d in buf_buddy_free_low ()
#1 0x00000000008150f4 in buf_LRU_block_remove_hashed_page ()
#2 0x0000000000817def in buf_LRU_invalidate_tablespace ()
#3 0x0000000000847cab in fil_delete_tablespace ()
#4 0x00000000007a606d in row_drop_table_for_mysql ()
#5 0x00000000007918ae in ha_innobase::delete_table(char const*) ()
#6 0x0000000000694c0e in ha_delete_table(THD*, handlerton*, char const*, char const*, char const*, bool) ()
#7 0x0000000000602e55 in mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool) ()
#8 0x00000000006034ad in mysql_rm_table(THD*, TABLE_LIST*, char, char) ()
#9 0x00000000005abd29 in mysql_execute_command(THD*) ()
#10 0x00000000005afaa8 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#11 0x00000000005b12c7 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#12 0x00000000006355ff in do_handle_one_connection(THD*) ()
#13 0x0000000000635700 in handle_one_connection ()
#14 0x0000000000900da6 in pfs_spawn_thread ()
#15 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0
#16 0x00000038888e68ed in clone () from /lib64/libc.so.6
这时候show engine innodb status也被hang住
innodb_show_status->srv_printf_innodb_monitor->buf_print_io->buf_stats_get_pool_info
这里需要buf_pool->LRU_list_mutex
线程状态:
Thread 41 (Thread 0x7fb8d04dc700 (LWP 2554)):
#0  0x0000003888c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000884811 in os_event_wait_low ()
#2  0x00000000007c42cf in sync_array_wait_event ()
#3  0x00000000007c54a3 in mutex_spin_wait ()
#4  0x0000000000803b0e in pfs_mutex_enter_func ()
#5  0x0000000000807775 in buf_print_io ()
#6  0x00000000007bf89d in srv_printf_innodb_monitor.clone.3 ()
#7  0x00000000007bfef0 in srv_monitor_thread ()
#8  0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0
#9  0x00000038888e68ed in clone () from /lib64/libc.so.6
从backtrace里,我们还可以看到master线程也被阻塞在获取s锁:
        Thread 39 (Thread 0x7fb8cf0da700 (LWP 2556)):
#0  0x0000003888c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000884811 in os_event_wait_low ()
#2  0x00000000007c42cf in sync_array_wait_event ()
#3  0x00000000007c4510 in rw_lock_s_lock_spin ()
#4  0x000000000080fcac in buf_flush_page_and_try_neighbors ()
#5  0x0000000000810dbf in buf_flush_batch ()
#6  0x0000000000811540 in buf_flush_list ()
#7  0x00000000007bca2e in srv_master_thread ()
#8  0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0
#9  0x00000038888e68ed in clone () from /lib64/libc.so.6
master线程在执行buf_flush_try_neighbors 时,需要频繁获取page_hash_latch的s锁
1454         rw_lock_s_lock(&buf_pool->page_hash_latch);
那么,我们有理由相信,如果drop table的过程足够慢,那么整个server必然会被hang住,直至超过900秒后crash掉。

May be related : http://bugs.mysql.com/bug.php?id=59727

我们注意到在MySQL5.5.18之后的版本中,fix了多个跟低效率drop table相关的issue。这里我们也做一些简单的测试。这样的问题并不好重现,大多数时候会很快完成drop,这应该和buffer pool的状态相关,但目前还不清楚如何每次都能重现,只能用脚本来循环
table size: 1,000,000 records
Percona 5.5.18 –nearly  1~2 minutes
Percona 5.5.28 –very quick
我们先来看看Percona 5.5.18中在innodb层drop table的流程吧
入口函数为row_drop_table_for_mysql :
1.加锁
    if (trx->dict_operation_lock_mode != RW_X_LATCH) {
        /* Prevent foreign key checks etc. while we are dropping the
        table */
        row_mysql_lock_data_dictionary(trx);
        locked_dictionary = TRUE;
    }
普通的drop语句会进入到if中,加dict_operation_lock的x锁并获取互斥锁dict_sys->mutex
同时将trx->dict_operation_lock_mode = RW_X_LATCH
这样做的目的是由于后续涉及到对数据词典和ibdata内的系统表的操作,因此需要串行化。
2.根据表名将该表加载到dict cache中
    table = dict_table_get_low_ignore_err(
        name, DICT_ERR_IGNORE_INDEX_ROOT | DICT_ERR_IGNORE_CORRUPT);
从dict_sys->table_hash读取(dict_table_check_if_in_cache_low)或者从磁盘读取(dict_load_table)
3.检查外键约束
4.如果table->n_mysql_handles_opened > 0 ,表示有别的线程打开该表,将其加入到background drop list上,然后goto funct_exit。
      added = row_add_table_to_background_drop_list(table->name);
MySQL5.5的MDL锁可以保证在drop table时不会有别的线程使用该表。但alter table的过程中如果需要drop table时,可以允许将表暂时放到row_mysql_drop_list上,当没有在该表上的查询时,有后台master线程来完成drop.
后面再单独分析
5.移除表上所有的锁
lock_remove_all_on_table(table, TRUE);
6.调用innodb层的SQL Parser来对数据词典做操作
row_upd_clust_step->….->dict_drop_index_tree //根据SYS_INDEXS中的记录删除对应的索引树
>>读取记录中存储的索引root_page_no、space id、是否是压缩表(zip_size = fil_space_get_zip_size(space))
>>释放所有除了root page外的所有page
btr_free_but_not_root(space, zip_size, root_page_no);

>>>释放leaf page segment

    finished = fseg_free_step(root + PAGE_HEADER + PAGE_BTR_SEG_LEAF,

                  &mtr);

函数fseg_free_step可以被重复调用来释放叶子page,这样可以避免过大的mtr.

从函数参数可以看出来,root + PAGE_HEADER + PAGE_BTR_SEG_LEAF指向了LEAF SEGMENT(每个索引有两个segment,一个管理叶子节点,一个管理非叶子节点)

–>读取一个extend descriptor

descr = xdes_get_descriptor(space, zip_size, header_page, mtr);

读取该space的0号page(其中记录了space header,在该page的FSP_HEADER_OFFSET处记录)。

然后再根据space header获取xdes_t。这里的header_page为3, 也就是root节点。实际上btree的root节点存储了两个SEGMENT的segment header.

inode = fseg_inode_try_get(header, space, zip_size, mtr);//读取该segment对应的segmeng inode

descr = fseg_get_first_extent(inode, space, zip_size, mtr);

获取该segment的第一个extent。extend list是按照FSEG_FULL->FSEG_NOT_FULL->FSEG_FREE来进行排序

//TODO:对文件系统部分进行系统分析

–>如果descr不为NULL

fseg_free_extent //释放一个extend

依次遍历该extend上的page,并从page hash index中删除,调用如下函数进行

            btr_search_drop_page_hash_when_freed(

                space, zip_size, first_page_in_extent + i);

     |–>先获取对应的block(buf_page_get_gen)

     |–>再调用btr_search_drop_page_hash_index去移除adaptive hash index中对应该page的记录

    性能瓶颈1:发生在buf_page_get_gen,函数的参数mode为BUF_PEEK_IF_IN_POOL,表示如果page不在bp中,则返回NULL。但如果page在bp中,并且是压缩表时,则继续下面的逻辑,会从bp中获取一个block(block = buf_LRU_get_free_block(buf_pool)),这里可能是耗时的地方

然后从inode上删除对应的extent节点

最后调用fsp_free_extent将该extend加到空闲链表上。

–>最后返回false到上层,下次会继续循环调用该函数来做释放

–>释放碎片页及inode等,这些都是很快的操作,不会成为瓶颈

>>>如果上一步调用函数fseg_free_step返回值为false,表明该segment的释放还没完成,goto leaf_loop继续调用fseg_free_step

>>>释放非叶子节点segement(通过PAGE_BTR_SEG_TOP来指定)

    finished = fseg_free_step_not_header(

        root + PAGE_HEADER + PAGE_BTR_SEG_TOP, &mtr);

流程和fseg_free_step类似,但不释放对应的inode

>>释放root page 
btr_free_root(space, zip_size, root_page_no, mtr);  //释放btree的根节点
>>将对应的索引在SYS_INDEXS中的记录DICT_SYS_INDEXES_PAGE_NO_FIELD置空
    page_rec_write_field(rec, DICT_SYS_INDEXES_PAGE_NO_FIELD,
                 FIL_NULL, mtr);
7. fil_delete_tablespace //真正删除表空间
>>检查是否有pending io/pending ibuf merge,有的话,则循环等待
>>如果开启了innodb_lazy_drop_table,调用函数buf_LRU_mark_space_was_deleted(id)
>>>扫描bp的LRU,找到对应的page,然后设置bpage->space_was_being_deleted = TRUE
>>>
>>如果没有设置,则调用buf_LRU_invalidate_tablespace

>>>buf_LRU_drop_page_hash_for_tablespace //删除所有跟该tablespace id 相关的page hash index记录

扫描buf_pool->LRU,采用批量清理的方法,首先临时分配一个1024个元素的page_arr,每当读到一个和当前drop table相关的page时,就加入到数组中,如果数组满了,则调用buf_LRU_drop_page_hash_batch删除对应page hash index记录

疑问:这个之前在dict_drop_index_tree 已经做过一次删除page hash index entries,为何这里还要再做??

>>>buf_LRU_invalidate_tablespace_buf_pool_instance

从backtrace来看,这里也是耗费时间主要的地方,会再次扫描buf_pool->LRU,读取和space id吻合的page

–>如果page被Pin住(buf_page_get_io_fix(bpage) != BUF_IO_NONE或者bpage->buf_fix_count > 0),读取下一个page

–>将page从LRU移除

buf_LRU_block_remove_hashed_page(bpage, TRUE)

|–>buf_LRU_remove_block(bpage);  //将bpage从LRU上移除

|–>读取page hash中对应的page(buf_page_hash_get_low),做必要的检查后,从page_hash中将当前page移除。

|–>对于普通page,只需要进行memset.如果是压缩表,还需要调用buf_buddy_free来释放压缩页内存

性能瓶颈2:从观察到的现象看,这里buf_buddy_free被长时间频繁调用到

另外,我们也可以看到,上面扫描了两次LRU,这也是bug 51325   、 bug 64284、 bug#59727 等bug讨论到的情况,在后续的版本也做了修改
此外,这里实际上重复进行了AHI记录的删除,从上面的分析可以看出,在释放btree的extend的时候,实际上已经做过了一次删除AHI记录的操作。
在5.5.23之后,这些问题基本都解决了,对fil_delete_tablespace做了优化,当drop table时,调用新的接口函数buf_LRU_flush_or_remove_pages->buf_flush_dirty_pages,从flush list上驱逐掉被删除的表page,并且不做写操作。比起之前的两次LRU扫描,仅仅扫描flush list将效率从分钟级别降低到秒级别。

不过在读代码的过程中发现在Percona5.5.28版本中依旧支持的innodb_lazy_drop_table会进入函数buf_LRU_mark_space_was_deleted,这里依旧还有对AHI记录的操作。我会接着尝试将其去除掉看看。

原创文章,转载请注明: 转载自Simple Life

本文链接地址: [MySQL 源码] MySQL drop table(压缩表)效率与流程分析

Post Footer automatically generated by wp-posturl plugin for wordpress.


Comments

Leave a Reply

Your email address will not be published. Name and email are required


Current month ye@r day *