From 65635cbc37e011e71b208257a25e7c1078cd039b Mon Sep 17 00:00:00 2001 From: Jun'ichi Nomura Date: Wed, 17 Oct 2012 17:45:36 +0900 Subject: blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg blk_put_rl() does not call blkg_put() for q->root_rl because we don't take request list reference on q->root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q->root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING && CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: > WARNING: at /work/build/linux/block/blk-cgroup.h:250 > blk_put_rl+0x4d/0x95() > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 > Call Trace: > [] warn_slowpath_common+0x85/0x9d > [] warn_slowpath_null+0x1a/0x1c > [] blk_put_rl+0x4d/0x95 > [] __blk_put_request+0xc3/0xcb > [] blk_finish_request+0x232/0x23f > [] ? blk_end_bidi_request+0x34/0x5d > [] blk_end_bidi_request+0x42/0x5d > [] blk_end_request+0x10/0x12 > [] scsi_io_completion+0x207/0x4d5 > [] scsi_finish_command+0xfa/0x103 > [] scsi_softirq_done+0xff/0x108 > [] blk_done_softirq+0x8d/0xa1 > [] ? > generic_smp_call_function_single_interrupt+0x9f/0xd7 > [] __do_softirq+0x102/0x213 > [] ? lock_release_holdtime+0xb6/0xbb > [] ? raise_softirq_irqoff+0x9/0x3d > [] call_softirq+0x1c/0x30 > [] do_softirq+0x4b/0xa3 > [] irq_exit+0x53/0xd5 > [] smp_call_function_single_interrupt+0x34/0x36 > [] call_function_single_interrupt+0x6f/0x80 > [] ? mwait_idle+0x94/0xcd > [] ? mwait_idle+0x8b/0xcd > [] cpu_idle+0xbb/0x114 > [] rest_init+0xc1/0xc8 > [] ? csum_partial_copy_generic+0x16c/0x16c > [] start_kernel+0x3d4/0x3e1 > [] ? kernel_init+0x1f7/0x1f7 > [] x86_64_start_reservations+0xb8/0xbd > [] x86_64_start_kernel+0x101/0x110 This patch clears q->root_blkg and q->root_rl.blkg when root blkg is destroyed. Signed-off-by: Jun'ichi Nomura Acked-by: Vivek Goyal Acked-by: Tejun Heo Cc: stable@kernel.org Signed-off-by: Jens Axboe --- block/blk-cgroup.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'block') diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index cafcd7431189..3ad5e3fbf579 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -285,6 +285,13 @@ static void blkg_destroy_all(struct request_queue *q) blkg_destroy(blkg); spin_unlock(&blkcg->lock); } + + /* + * root blkg is destroyed. Just clear the pointer since + * root_rl does not take reference on root blkg. + */ + q->root_blkg = NULL; + q->root_rl.blkg = NULL; } static void blkg_rcu_free(struct rcu_head *rcu_head) -- cgit v1.2.3 From 65c77fd9e8a1c8c3da0bbbea6b7efa3d6ef265f8 Mon Sep 17 00:00:00 2001 From: Jun'ichi Nomura Date: Mon, 22 Oct 2012 10:15:37 +0900 Subject: blkcg: stop iteration early if root_rl is the only request list __blk_queue_next_rl() finds next request list based on blkg_list while skipping root_blkg in the list. OTOH, root_rl is special as it may exist even without root_blkg. Though the later part of the function handles such a case correctly, exiting early is good for readability of the code. Signed-off-by: Jun'ichi Nomura Cc: Tejun Heo Acked-by: Vivek Goyal Signed-off-by: Jens Axboe --- block/blk-cgroup.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'block') diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 3ad5e3fbf579..d0b770391ad4 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -333,6 +333,9 @@ struct request_list *__blk_queue_next_rl(struct request_list *rl, */ if (rl == &q->root_rl) { ent = &q->blkg_list; + /* There are no more block groups, hence no request lists */ + if (list_empty(ent)) + return NULL; } else { blkg = container_of(rl, struct blkcg_gq, rl); ent = &blkg->q_node; -- cgit v1.2.3 From 8e42e0a23d30ba84d8e946042ee82aac4934048a Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 23 Oct 2012 13:01:46 -0700 Subject: block: remove CONFIG_EXPERIMENTAL This config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it. CC: Jens Axboe Signed-off-by: Kees Cook Signed-off-by: Jens Axboe --- block/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'block') diff --git a/block/Kconfig b/block/Kconfig index 09acf1b39905..a7e40a7c8214 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -89,7 +89,7 @@ config BLK_DEV_INTEGRITY config BLK_DEV_THROTTLING bool "Block layer bio throttling support" - depends on BLK_CGROUP=y && EXPERIMENTAL + depends on BLK_CGROUP=y default n ---help--- Block layer bio throttling support. It can be used to limit -- cgit v1.2.3 From 975927b942c932bd839ed07e5d40b4037d816844 Mon Sep 17 00:00:00 2001 From: Jianpeng Ma Date: Thu, 25 Oct 2012 21:58:17 +0200 Subject: block: Add blk_rq_pos(rq) to sort rq when plushing My workload is a raid5 which had 16 disks. And used our filesystem to write using direct-io mode. I used the blktrace to find those message: 8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5] 8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5] 8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5] 8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5] 8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5] 8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5] 8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5] 8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5] 8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5] 8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5] 8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5] 8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5] 8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5] 8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5] 8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5] 8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5] 8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5] 8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5] 8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5] 8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5] 8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5] 8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5] 8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5] 8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5] 8,16 0 0 2.453853661 0 m N cfq2579 insert_request 8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5] 8,16 0 0 2.453854439 0 m N cfq2579 insert_request 8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2 8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1 8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert 8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request 8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1 8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5] 8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1 8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert 8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request 8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2 8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5] 8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0] 8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0 8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0] 8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0 8,16 0 0 2.454795160 0 m N cfq schedule dispatch From above messages,we can find rq[W 7493144 + 104] and rq[W 7493120 + 24] do not merge. Because the bio order is: 8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5] 8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5] 8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5] 8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5] The bio(7493144) first and bio(7493120) later.So the subsequent bios will be divided into two parts. When flushing plug-list,because elv_attempt_insert_merge only support backmerge,not supporting frontmerge. So rq[7493120 + 24] can't merge with rq[7493144 + 104]. From my test,i found those situation can count 25% in our system. Using this patch, there is no this situation. Signed-off-by: Jianpeng Ma CC:Shaohua Li Signed-off-by: Jens Axboe --- block/blk-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index a33870b1847b..3c95c4d6e31a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2868,7 +2868,8 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b) struct request *rqa = container_of(a, struct request, queuelist); struct request *rqb = container_of(b, struct request, queuelist); - return !(rqa->q <= rqb->q); + return !(rqa->q < rqb->q || + (rqa->q == rqb->q && blk_rq_pos(rqa) < blk_rq_pos(rqb))); } /* -- cgit v1.2.3