aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2015-03-09Merge tag 'v3.14.35' into v3.14-rtAnders Roxell
This is the 3.14.35 stable release
2015-03-06blk-throttle: check stats_cpu before reading it from sysfsThadeu Lima de Souza Cascardo
commit 045c47ca306acf30c740c285a77a4b4bda6be7c5 upstream. When reading blkio.throttle.io_serviced in a recently created blkio cgroup, it's possible to race against the creation of a throttle policy, which delays the allocation of stats_cpu. Like other functions in the throttle code, just checking for a NULL stats_cpu prevents the following oops caused by that race. [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020 [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1] [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4 [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5 [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000 [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980 [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0) [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 42008884 XER: 20000000 [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0 GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000 GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000 GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808 GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000 GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80 GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850 [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180 [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180 [ 1137.734943] Call Trace: [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable) [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0 [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70 [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150 [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50 [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510 [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200 [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80 [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0 [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100 [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98 [ 1137.735383] Instruction dump: [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24 [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018 And here is one code that allows to easily reproduce this, although this has first been found by running docker. void run(pid_t pid) { int n; int status; int fd; char *buffer; buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE); n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid); fd = open(CGPATH "/test/tasks", O_WRONLY); write(fd, buffer, n); close(fd); if (fork() > 0) { fd = open("/dev/sda", O_RDONLY | O_DIRECT); read(fd, buffer, 512); close(fd); wait(&status); } else { fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY); n = read(fd, buffer, BUFFER_SIZE); close(fd); } free(buffer); exit(0); } void test(void) { int status; mkdir(CGPATH "/test", 0666); if (fork() > 0) wait(&status); else run(getpid()); rmdir(CGPATH "/test"); } int main(int argc, char **argv) { int i; for (i = 0; i < NR_TESTS; i++) test(); return 0; } Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06cfq-iosched: fix incorrect filing of rt async cfqqJeff Moyer
commit c6ce194325cef342313e3d27620411ce90a89c50 upstream. Hi, If you can manage to submit an async write as the first async I/O from the context of a process with realtime scheduling priority, then a cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It ends up in the best effort array, but actually has realtime I/O scheduling priority set in cfqq->ioprio. The reason is that cfq_get_queue assumes the default scheduling class and priority when there is no information present (i.e. when the async cfqq is created): static struct cfq_queue * cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic, struct bio *bio, gfp_t gfp_mask) { const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio); cic->ioprio starts out as 0, which is "invalid". So, class of 0 (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so: async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio); static struct cfq_queue ** cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio) { switch (ioprio_class) { case IOPRIO_CLASS_RT: return &cfqd->async_cfqq[0][ioprio]; case IOPRIO_CLASS_NONE: ioprio = IOPRIO_NORM; /* fall through */ case IOPRIO_CLASS_BE: return &cfqd->async_cfqq[1][ioprio]; case IOPRIO_CLASS_IDLE: return &cfqd->async_idle_cfqq; default: BUG(); } } Here, instead of returning a class mapped from the process' scheduling priority, we get back the bucket associated with IOPRIO_CLASS_BE. Now, there is no queue allocated there yet, so we create it: cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask); That function ends up doing this: cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync); cfq_init_prio_data(cfqq, cic); cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio data does this: ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); switch (ioprio_class) { default: printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class); case IOPRIO_CLASS_NONE: /* * no prio set, inherit CPU scheduling settings */ cfqq->ioprio = task_nice_ioprio(tsk); cfqq->ioprio_class = task_nice_ioclass(tsk); break; So we basically have two code paths that treat IOPRIO_CLASS_NONE differently, which results in an RT async cfqq filed into a best effort bucket. Attached is a patch which fixes the problem. I'm not sure how to make it cleaner. Suggestions would be welcome. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06cfq-iosched: handle failure of cfq group allocationKonstantin Khlebnikov
commit 69abaffec7d47a083739b79e3066cb3730eba72e upstream. Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC. In cfq_find_alloc_queue() possible allocation failure is not handled. As a result kernel oopses on NULL pointer dereference when cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer. Bug was introduced in v3.5 in commit cd1604fab4f9 ("blkcg: factor out blkio_group creation"). Prior to that commit cfq group lookup had returned pointer to root group as fallback. This patch handles this error using existing fallback oom_cfqq. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Fixes: cd1604fab4f9 ("blkcg: factor out blkio_group creation") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-20Merge tag 'v3.14.29' into v3.14-rtAnders Roxell
This is the 3.14.29 stable release
2015-01-16genhd: check for int overflow in disk_expand_part_tbl()Jens Axboe
commit 5fabcb4c33fe11c7e3afdf805fde26c1a54d0953 upstream. We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition() with a user passed in partno value. If we pass in 0x7fffffff, the new target in disk_expand_part_tbl() overflows the 'int' and we access beyond the end of ptbl->part[] and even write to it when we do the rcu_assign_pointer() to assign the new partition. Reported-by: David Ramos <daramos@stanford.edu> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu mapJens Axboe
commit a33c1ba2913802b6fb23e974bb2f6a4e73c8b7ce upstream. We currently use num_possible_cpus(), but that breaks on sparc64 where the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID instead, so we don't end up reading from invalid memory. Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-17Merge tag 'v3.14.24' into v3.14-rtAnders Roxell
This is the 3.14.24 stable release Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Conflicts: net/ipv4/ip_output.c
2014-11-14scsi: Fix error handling in SCSI_IOCTL_SEND_COMMANDJan Kara
commit 84ce0f0e94ac97217398b3b69c21c7a62ebeed05 upstream. When sg_scsi_ioctl() fails to prepare request to submit in blk_rq_map_kern() we jump to a label where we just end up copying (luckily zeroed-out) kernel buffer to userspace instead of reporting error. Fix the problem by jumping to the right label. CC: Jens Axboe <axboe@kernel.dk> CC: linux-scsi@vger.kernel.org Coverity-id: 1226871 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixed up the, now unused, out label. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-14block: fix alignment_offset math that assumes io_min is a power-of-2Mike Snitzer
commit b8839b8c55f3fdd60dc36abcda7e0266aff7985c upstream. The math in both blk_stack_limits() and queue_limit_alignment_offset() assume that a block device's io_min (aka minimum_io_size) is always a power-of-2. Fix the math such that it works for non-power-of-2 io_min. This issue (of alignment_offset != 0) became apparent when testing dm-thinp with a thinp blocksize that matches a RAID6 stripesize of 1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data block size") unlocked the potential for alignment_offset != 0 due to the dm-thin-pool's io_min possibly being a non-power-of-2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-07Merge tag 'v3.14.20' into v3.14-rtAnders Roxell
This is the 3.14.20 stable release
2014-10-05partitions: aix.c: off by one bugDan Carpenter
commit d97a86c170b4e432f76db072a827fe30b4d6f659 upstream. The lvip[] array has "state->limit" elements so the condition here should be >= instead of >. Fixes: 6ceea22bbbc8 ('partitions: add aix lvm partition support files') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05genhd: fix leftover might_sleep() in blk_free_devt()Jens Axboe
commit 46f341ffcfb5d8530f7d1e60f3be06cce6661b62 upstream. Commit 2da78092 changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05block: Fix dev_t minor allocation lifetimeKeith Busch
commit 2da78092dda13f1efd26edbbf99a567776913750 upstream. Releases the dev_t minor when all references are closed to prevent another device from acquiring the same major/minor. Since the partition's release may be invoked from call_rcu's soft-irq context, the ext_dev_idr's mutex had to be replaced with a spinlock so as not so sleep. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05cfq-iosched: Fix wrong children_weight calculationToshiaki Makita
commit e15693ef18e13e3e6bffe891fe140f18b8ff6d07 upstream. cfq_group_service_tree_add() is applying new_weight at the beginning of the function via cfq_update_group_weight(). This actually allows weight to change between adding it to and subtracting it from children_weight, and triggers WARN_ON_ONCE() in cfq_group_service_tree_del(), or even causes oops by divide error during vfr calculation in cfq_group_service_tree_add(). The detailed scenario is as follows: 1. Create blkio cgroups X and Y as a child of X. Set X's weight to 500 and perform some I/O to apply new_weight. This X's I/O completes before starting Y's I/O. 2. Y starts I/O and cfq_group_service_tree_add() is called with Y. 3. cfq_group_service_tree_add() walks up the tree during children_weight calculation and adds parent X's weight (500) to children_weight of root. children_weight becomes 500. 4. Set X's weight to 1000. 5. X starts I/O and cfq_group_service_tree_add() is called with X. 6. cfq_group_service_tree_add() applies its new_weight (1000). 7. I/O of Y completes and cfq_group_service_tree_del() is called with Y. 8. I/O of X completes and cfq_group_service_tree_del() is called with X. 9. cfq_group_service_tree_del() subtracts X's weight (1000) from children_weight of root. children_weight becomes -500. This triggers WARN_ON_ONCE(). 10. Set X's weight to 500. 11. X starts I/O and cfq_group_service_tree_add() is called with X. 12. cfq_group_service_tree_add() applies its new_weight (500) and adds it to children_weight of root. children_weight becomes 0. Calcularion of vfr triggers oops by divide error. weight should be updated right before adding it to children_weight. Reported-by: Ruki Sekiya <sekiya.ruki@lab.ntt.co.jp> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-01Merge tag 'v3.14.15' into v3.14-rtAnders Roxell
This is the 3.14.15 stable release
2014-07-31blkcg: don't call into policy draining if root_blkg is already goneTejun Heo
commit 0b462c89e31f7eb6789713437eb551833ee16ff3 upstream. While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD e4a1067 PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b 776687bce42b ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shirish Pargaonkar <spargaonkar@suse.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jet Chen <jet.chen@intel.com> Tested-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31block: don't assume last put of shared tags is for the hostChristoph Hellwig
commit d45b3279a5a2252cafcd665bbf2db8c9b31ef783 upstream. There is no inherent reason why the last put of a tag structure must be the one for the Scsi_Host, as device model objects can be held for arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single funtion that just release a references and get rid of the BUG() when the host reference wasn't the last. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31block: provide compat ioctl for BLKZEROOUTMikulas Patocka
commit 3b3a1814d1703027f9867d0f5cbbfaf6c7482474 upstream. This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer to two uint64_t values, so there is no need to translate it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-10Merge tag 'v3.14.12' into v3.14-rtAnders Roxell
This is the 3.14.12 stable release
2014-07-09blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt ↵Tejun Heo
an atomic_t commit a5049a8ae34950249a7ae94c385d7c5c98914412 upstream. Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, 2a4fd070ee85 ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-08Merge tag 'v3.14.10' into v3.14-rtAnders Roxell
This is the 3.14.10 stable release Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Conflicts: include/linux/irqdesc.h include/linux/thread_info.h kernel/irq/manage.c kernel/locking/rtmutex.c
2014-06-30Changes since v3.14.2-rt2Anders Roxell
- rwsem readers are now not allowed to nest. A patch rom Steven Rostedt. - a few bugs were fixed in the hotplug code which were made during the v3.14 port. Fixed by Mike Galbraith. - Mike Galbraith sent a patch which might fix lazy preempt on x86_64. Patch applied and my machine still explodes therefore lazy preempt remains off on x86_64. - Mike Galbraith sent a few patches to get cpu hoplug to work. This includes lg_global_trylock_relax(). - A few push downs of migrate_disable() (where we call migrate_disable() after the rt_mutex_trylock()) have been reverted. It seems hotplug is not too happy about this. A patch by Steven Rostedt and and Mike Galbraith - There was a complaint about a backrace from run_local_timers() in UP mode because a spin_try_lock() failed. _This_ particular case was not an error. This optimization was for FULL_NO_HZ which is pointless on UP because there is no spare CPU. Therefore, this optimization is disabled in UP mode and the backtrace is gone. Reported by Stanislav Meduna. - block-mq notifier uses now a spinlock and runs during CPU_POST_DEAD instead at CPU_DEAD time. lockdep complained about the sleeping ctx->lock within the rawlock (blk_mq_cpu_notify_lock) and CPU_DEAD runs with irqs off. Known issues: - bcache is disabled. - lazy preempt on x86_64 leads to a crash with some load. - CPU hotplug works in general. Steven's test script however deadlocks usually on the second invocation. Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-06-30blk-mq: revert raw locks, post pone notifier to POST_DEADSebastian Andrzej Siewior
The blk_mq_cpu_notify_lock should be raw because some CPU down levels are called with interrupts off. The notifier itself calls currently one function that is blk_mq_hctx_notify(). That function acquires the ctx->lock lock which is sleeping and I would prefer to keep it that way. That function only moves IO-requests from the CPU that is going offline to another CPU and it is currently the only one. Therefore I revert the list lock back to sleeping spinlocks and let the notifier run at POST_DEAD time. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-06-30softirq: Check preemption after reenabling interruptsThomas Gleixner
raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-06-30block: Use cpu_chill() for retry loopsThomas Gleixner
Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-06-30block: mq: use cpu_light()Sebastian Andrzej Siewior
there is a might sleep splat because get_cpu() disables preemption and later we grab a lock. As a workaround for this we use get_cpu_light() and an additional lock to prevent taking the same ctx. There is a lock member in the ctx already but there some functions which do ++ on the member and this works with irq off but on RT we would need the extra lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-06-30fs-block-rt-support.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-06-30block: Shorten interrupt disabled regionsThomas Gleixner
Moving the blk_sched_flush_plug() call out of the interrupt/preempt disabled region in the scheduler allows us to replace local_irq_save/restore(flags) by local_irq_disable/enable() in blk_flush_plug(). Now instead of doing this we disable interrupts explicitely when we lock the request_queue and reenable them when we drop the lock. That allows interrupts to be handled when the plug list contains requests for more than one queue. Aside of that this change makes the scope of the irq disabled region more obvious. The current code confused the hell out of me when looking at: local_irq_save(flags); spin_lock(q->queue_lock); ... queue_unplugged(q...); scsi_request_fn(); spin_unlock(q->queue_lock); spin_lock(shost->host_lock); spin_unlock_irq(shost->host_lock); -------------------^^^ ???? spin_lock_irq(q->queue_lock); spin_unlock(q->lock); local_irq_restore(flags); Also add a comment to __blk_run_queue() documenting that q->request_fn() can drop q->queue_lock and reenable interrupts, but must return with q->queue_lock held and interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2014-05-31blktrace: fix accounting of partially completed requestsRoman Pen
commit af5040da01ef980670b3741b3e10733ee3e33566 upstream. trace_block_rq_complete does not take into account that request can be partially completed, so we can get the following incorrect output of blkparser: C R 232 + 240 [0] C R 240 + 232 [0] C R 248 + 224 [0] C R 256 + 216 [0] but should be: C R 232 + 8 [0] C R 240 + 8 [0] C R 248 + 8 [0] C R 256 + 8 [0] Also, the whole output summary statistics of completed requests and final throughput will be incorrect. This patch takes into account real completion size of the request and fixes wrong completion accounting. Signed-off-by: Roman Pen <r.peniaev@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> CC: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-20block: free q->flush_rq in blk_init_allocated_queue error pathsDave Jones
Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but neglected to free that allocation on the error paths that follow. Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-08block: change flush sequence list addition back to front addMike Snitzer
Commit 18741986 inadvertently changed the rq flush insertion from a head to a tail insertion. Fix that back up. Signed-off-by: Mike Snitzer <msnitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-08block: fix q->flush_rq NULL pointer crash on dm-mpath flushMike Snitzer
Commit 1874198 ("blk-mq: rework flush sequencing logic") switched ->flush_rq from being an embedded member of the request_queue structure to being dynamically allocated in blk_init_queue_node(). Request-based DM multipath doesn't use blk_init_queue_node(), instead it uses blk_alloc_queue_node() + blk_init_allocated_queue(). Because commit 1874198 placed the dynamic allocation of ->flush_rq in blk_init_queue_node() any flush issued to a dm-mpath device would crash with a NULL pointer, e.g.: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 PGD bb3c7067 PUD bb01d067 PMD 0 Oops: 0002 [#1] SMP ... CPU: 5 PID: 5028 Comm: dt Tainted: G W O 3.14.0-rc3.snitm+ #10 ... task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000 RIP: 0010:[<ffffffff8125037e>] [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 RSP: 0018:ffff880079565c98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030 RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001 R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000 R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000 FS: 00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0 Stack: 0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47 ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000 ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8 Call Trace: [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0 [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210 [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320 [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190 [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330 [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod] [<ffffffff812530c0>] generic_make_request+0xc0/0x100 [<ffffffff81253173>] submit_bio+0x73/0x140 [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80 [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0 [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60 [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20 [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20 [<ffffffff811b81f1>] do_fsync+0x41/0x80 [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80 [<ffffffff811b8260>] SyS_fsync+0x10/0x20 [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b Fix this by moving the ->flush_rq allocation from blk_init_queue_node() to blk_init_allocated_queue(). blk_init_queue_node() also calls blk_init_allocated_queue() so this change is functionality equivalent for all blk_init_queue_node() callers. Reported-by: Hannes Reinecke <hare@suse.de> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-07blk-mq: add REQ_SYNC earlyShaohua Li
Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init is set correctly. Signed-off-by: Shaohua Li<shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-03rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlockMike Galbraith
[ 365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674 [ 365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1 [ 365.164043] no locks held by migration/1/26. [ 365.164044] irq event stamp: 6648 [ 365.164056] hardirqs last enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30 [ 365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120 [ 365.164070] softirqs last enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920 [ 365.164072] softirqs last disabled at (0): [< (null)>] (null) [ 365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF N 3.12.12-rt19-0.gcb6c4a2-rt #3 [ 365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011 [ 365.164091] 0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0 [ 365.164099] ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f [ 365.164107] ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0 [ 365.164108] Call Trace: [ 365.164119] [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0 [ 365.164127] [<ffffffff81004872>] dump_trace+0x92/0x360 [ 365.164133] [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60 [ 365.164138] [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0 [ 365.164143] [<ffffffff81006160>] show_stack+0x20/0x50 [ 365.164153] [<ffffffff815367e6>] dump_stack+0x54/0x9a [ 365.164163] [<ffffffff8108919c>] __might_sleep+0xfc/0x140 [ 365.164173] [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70 [ 365.164182] [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70 [ 365.164191] [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70 [ 365.164201] [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10 [ 365.164207] [<ffffffff810567be>] cpu_notify+0x1e/0x40 [ 365.164217] [<ffffffff81525da2>] take_cpu_down+0x22/0x40 [ 365.164223] [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120 [ 365.164229] [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0 [ 365.164235] [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380 [ 365.164241] [<ffffffff8107cbf8>] kthread+0xc8/0xd0 [ 365.164250] [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0 [ 365.164429] smpboot: CPU 1 is now offline Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21blk-mq: support partial I/O completionsChristoph Hellwig
Add a new blk_mq_end_io_partial function to partially complete requests as needed by the SCSI layer. We do this by reusing blk_update_request to advance the bio instead of having a simplified version of it in the blk-mq code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21blk-mq: merge blk_mq_insert_request and blk_mq_run_requestChristoph Hellwig
It's almost identical to blk_mq_insert_request, so fold the two into one slightly more generic function by making the flush special case a bit smarted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21blk-mq: remove blk_mq_alloc_rqChristoph Hellwig
There's only one caller, which is a straight wrapper and fits the naming scheme of the related functions a lot better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-14Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO fixes from Jens Axboe: "Second round of updates and fixes for 3.14-rc2. Most of this stuff has been queued up for a while. The notable exception is the blk-mq changes, which are naturally a bit more in flux still. The pull request contains: - Two bug fixes for the new immutable vecs, causing crashes with raid or swap. From Kent. - Various blk-mq tweaks and fixes from Christoph. A fix for integrity bio's from Nic. - A few bcache fixes from Kent and Darrick Wong. - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas Swenson, and Roger Pau Monne. - Fix for a vec miscount with integrity vectors from Martin. - Minor annotations or fixes from Masanari Iida and Rashika Kheria. - Tweak to null_blk to do more normal FIFO processing of requests from Shlomo Pongratz. - Elevator switching bypass fix from Tejun. - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from me" * 'for-linus' of git://git.kernel.dk/linux-block: (31 commits) block: add cond_resched() to potentially long running ioctl discard loop xen-blkback: init persistent_purge_work work_struct blk-mq: pair blk_mq_start_request / blk_mq_requeue_request blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq block: Fix cloning of discard/write same bios block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show blk-mq: rework flush sequencing logic null_blk: use blk_complete_request and blk_mq_complete_request virtio_blk: use blk_mq_complete_request blk-mq: rework I/O completions fs: Add prototype declaration to appropriate header file include/linux/bio.h fs: Mark function as static in fs/bio-integrity.c block/null_blk: Fix completion processing from LIFO to FIFO block: Explicitly handle discard/write same segments block: Fix nr_vecs for inline integrity vectors blk-mq: Add bio_integrity setup to blk_mq_make_request blk-mq: initialize sg_reserved_size blk-mq: handle dma_drain_size blk-mq: divert __blk_put_request for MQ ops blk-mq: support at_head inserations for blk_execute_rq ...
2014-02-12block: add cond_resched() to potentially long running ioctl discard loopJens Axboe
When mkfs issues a full device discard and the device only supports discards of a smallish size, we can loop in blkdev_issue_discard() for a long time. If preempt isn't enabled, this can turn into a softlock situation and the kernel will start complaining. Add an explicit cond_resched() at the end of the loop to avoid that. Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-11blk-mq: pair blk_mq_start_request / blk_mq_requeue_requestChristoph Hellwig
Make sure we have a proper pairing between starting and requeueing requests. Move the dma drain and REQ_END setup into blk_mq_start_request, and make sure blk_mq_requeue_request properly undoes them, giving us a pair of function to prepare and unprepare a request without leaving side effects. Together this ensures we always clean up properly after BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-11blk-mq: dont assume rq->errors is set when returning an error from ->queue_rqChristoph Hellwig
rq->errors never has been part of the communication protocol between drivers and the block stack and most drivers will not have initialized it. Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR unconditionally. If a driver want to return a different error it can easily do so by returning success after calling blk_mq_end_io itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_showMasanari Iida
cppcheck detected following format string mismatch. [blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. Change "cpu" from int to unsigned int, because the cpu never become minus value. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10blk-mq: rework flush sequencing logicChristoph Hellwig
Witch to using a preallocated flush_rq for blk-mq similar to what's done with the old request path. This allows us to set up the request properly with a tag from the actually allowed range and ->rq_disk as needed by some drivers. To make life easier we also switch to dynamic allocation of ->flush_rq for the old path. This effectively reverts most of "blk-mq: fix for flush deadlock" and "blk-mq: Don't reserve a tag for flush request" Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10blk-mq: rework I/O completionsChristoph Hellwig
Rework I/O completions to work more like the old code path. blk_mq_end_io now stays out of the business of deferring completions to others CPUs and calling blk_mark_rq_complete. The latter is very important to allow completing requests that have timed out and thus are already marked completed, the former allows using the IPI callout even for driver specific completions instead of having to reimplement them. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07block: Explicitly handle discard/write same segmentsKent Overstreet
Immutable biovecs changed the way biovecs are interpreted - drivers no longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for using part of an existing segment without modifying it). This breaks with discards and write_same bios, since for those bi_size has nothing to do with segments in the biovec. So for now, we need a fairly gross hack - we fortunately know that there will never be more than one segment for the entire request, so we can special case discard/write_same. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07blk-mq: Add bio_integrity setup to blk_mq_make_requestNicholas Bellinger
This patch adds the missing bio_integrity_enabled() + bio_integrity_prep() setup into blk_mq_make_request() in order to use DIF protection with scsi-mq. Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07blk-mq: initialize sg_reserved_sizeChristoph Hellwig
To behave the same way as the old request path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07blk-mq: handle dma_drain_sizeChristoph Hellwig
Make blk-mq handle the dma_drain_size field the same way as the old request path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-07blk-mq: divert __blk_put_request for MQ opsChristoph Hellwig
__blk_put_request needs to call into the blk-mq code just like blk_put_request. As we don't have the queue lock in this case both end up calling the same function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>