aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2012-08-15mm: hugetlbfs: close race during teardown of hugetlbfs shared page tablesMel Gorman
commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream. If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a "bad pmd" and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..........] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7). [ 127.186778] ------------[ cut here ]------------ [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: 0000 [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0 [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00 [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003 [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8 [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8 [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000 [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0 [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0) [ 127.186821] Stack: [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98 [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000 [ 127.186827] Call Trace: [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80 [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220 [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0 [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0 [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50 [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130 [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0 [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230 [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150 [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30 [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80 [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360 [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170 [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP <ffff8804144b5c08> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The bug is a race and not always easy to reproduce. To reproduce it I was doing the following on a single socket I7-based machine with 16G of RAM. $ hugeadm --pool-pages-max DEFAULT:13G $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall $ for i in `seq 1 9000`; do ./hugetlbfs-test; done On my particular machine, it usually triggers within 10 minutes but enabling debug options can change the timing such that it never hits. Once the bug is triggered, the machine is in trouble and needs to be rebooted. The machine will respond but processes accessing proc like "ps aux" will hang due to the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b. The basic problem is a race between page table sharing and teardown. For the most part page table sharing depends on i_mmap_mutex. In some cases, it is also taking the mm->page_table_lock for the PTE updates but with shared page tables, it is the i_mmap_mutex that is more important. Unfortunately it appears to be also insufficient. Consider the following situation Process A Process B --------- --------- hugetlb_fault shmdt LockWrite(mmap_sem) do_munmap unmap_region unmap_vmas unmap_single_vma unmap_hugepage_range Lock(i_mmap_mutex) Lock(mm->page_table_lock) huge_pmd_unshare/unmap tables <--- (1) Unlock(mm->page_table_lock) Unlock(i_mmap_mutex) huge_pte_alloc ... Lock(i_mmap_mutex) ... vma_prio_walk, find svma, spte ... Lock(mm->page_table_lock) ... share spte ... Unlock(mm->page_table_lock) ... Unlock(i_mmap_mutex) ... hugetlb_no_page <--- (2) free_pgtables unlink_file_vma hugetlb_free_pgd_range remove_vma_list In this scenario, it is possible for Process A to share page tables with Process B that is trying to tear them down. The i_mmap_mutex on its own does not prevent Process A walking Process B's page tables. At (1) above, the page tables are not shared yet so it unmaps the PMDs. Process A sets up page table sharing and at (2) faults a new entry. Process B then trips up on it in free_pgtables. This patch fixes the problem by adding a new function __unmap_hugepage_range_final that is only called when the VMA is about to be destroyed. This function clears VM_MAYSHARE during unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA ineligible for sharing and avoids the race. Superficially this looks like it would then be vunerable to truncate and madvise issues but hugetlbfs has its own truncate handlers so does not use unmap_mapping_range() and does not support madvise(DONTNEED). This should be treated as a -stable candidate if it is merged. Test program is as follows. The test case was mostly written by Michal Hocko with a few minor changes to reproduce this bug. ==== CUT HERE ==== static size_t huge_page_size = (2UL << 20); static size_t nr_huge_page_A = 512; static size_t nr_huge_page_B = 5632; unsigned int get_random(unsigned int max) { struct timeval tv; gettimeofday(&tv, NULL); srandom(tv.tv_usec); return random() % max; } static void play(void *addr, size_t size) { unsigned char *start = addr, *end = start + size, *a; start += get_random(size/2); /* we could itterate on huge pages but let's give it more time. */ for (a = start; a < end; a += 4096) *a = 0; } int main(int argc, char **argv) { key_t key = IPC_PRIVATE; size_t sizeA = nr_huge_page_A * huge_page_size; size_t sizeB = nr_huge_page_B * huge_page_size; int shmidA, shmidB; void *addrA = NULL, *addrB = NULL; int nr_children = 300, n = 0; if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } fork_child: switch(fork()) { case 0: switch (n%3) { case 0: play(addrA, sizeA); break; case 1: play(addrB, sizeB); break; case 2: break; } break; case -1: perror("fork:"); break; default: if (++n < nr_children) goto fork_child; play(addrA, sizeA); break; } shmdt(addrA); shmdt(addrB); do { wait(NULL); } while (--n > 0); shmctl(shmidA, IPC_RMID, NULL); shmctl(shmidB, IPC_RMID, NULL); return 0; } [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build] Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15mm: mmu_notifier: fix freed page still mapped in secondary MMUXiao Guangrong
commit 3ad3d901bbcfb15a5e4690e55350db0899095a68 upstream. mmu_notifier_release() is called when the process is exiting. It will delete all the mmu notifiers. But at this time the page belonging to the process is still present in page tables and is present on the LRU list, so this race will happen: CPU 0 CPU 1 mmu_notifier_release: try_to_unmap: hlist_del_init_rcu(&mn->hlist); ptep_clear_flush_notify: mmu nofifler not found free page !!!!!! /* * At the point, the page has been * freed, but it is still mapped in * the secondary MMU. */ mn->ops->release(mn, mm); Then the box is not stable and sometimes we can get this bug: [ 738.075923] BUG: Bad page state in process migrate-perf pfn:03bec [ 738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping: (null) index:0x8076 [ 738.075936] page flags: 0x20000000000014(referenced|dirty) The same issue is present in mmu_notifier_unregister(). We can call ->release before deleting the notifier to ensure the page has been unmapped from the secondary MMU before it is freed. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15mm: setup pageblock_order before it's used by sparsememXishi Qiu
commit ca57df79d4f64e1a4886606af4289d40636189c5 upstream. On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as Itanium, pageblock_order is a variable with default value of 0. It's set to the right value by set_pageblock_order() in function free_area_init_core(). But pageblock_order may be used by sparse_init() before free_area_init_core() is called along path: sparse_init() ->sparse_early_usemaps_alloc_node() ->usemap_size() ->SECTION_BLOCKFLAGS_BITS ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS) The uninitialized pageblock_size will cause memory wasting because usemap_size() returns a much bigger value then it's really needed. For example, on an Itanium platform, sparse_init() pageblock_order=0 usemap_size=24576 free_area_init_core() before pageblock_order=0, usemap_size=24576 free_area_init_core() after pageblock_order=12, usemap_size=8 That means 24K memory has been wasted for each section, so fix it by calling set_pageblock_order() from sparse_init(). Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <liuj97@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()Joonsoo Kim
commit dc32f63453f56d07a1073a697dcd843dd3098c09 upstream. Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for use by compaction") changed the declaration of migrate_pages() and migrate_huge_pages(). But it missed changing the argument of migrate_huge_pages() in soft_offline_huge_page(). In this case, we should call migrate_huge_pages() with MIGRATE_SYNC. Additionally, there is a mismatch between type the of argument and the function declaration for migrate_pages(). Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15memcg: further prevent OOM with too many dirty pagesHugh Dickins
commit c3b94f44fcb0725471ecebb701c077a0ed67bd07 upstream. The may_enter_fs test turns out to be too restrictive: though I saw no problem with it when testing on 3.5-rc6, it very soon OOMed when I tested on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just slightly changed the way I started off the testing: dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick. ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why the transaction needs to be started even before allocating pagecache memory. But it may not be worth worrying about these days: if direct reclaim avoids FS writeback, does __GFP_FS now mean anything? Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop device; but since that also masks off __GFP_IO, we can test for __GFP_IO directly, ignoring may_enter_fs and __GFP_FS. But even so, the test still OOMs sometimes: when originally testing on 3.5-rc6, it OOMed about one time in five or ten; when testing just now on 3.5-rc6-mm1, it OOMed on the first iteration. This residual problem comes from an accumulation of pages under ordinary writeback, not marked PageReclaim, so rightly not causing the memcg check to wait on their writeback: these too can prevent shrink_page_list() from freeing any pages, so many times that memcg reclaim fails and OOMs. Deal with these in the same way as direct reclaim now deals with dirty FS pages: mark them PageReclaim. It is appropriate to rotate these to tail of list when writepage completes, but more importantly, the PageReclaim flag makes memcg reclaim wait on them if encountered again. Increment NR_VMSCAN_IMMEDIATE? That's arguable: I chose not. Setting PageReclaim here may occasionally race with end_page_writeback() clearing it: lru_deactivate_fn() already faced the same race, and correctly concluded that the window is small and the issue non-critical. With these changes, the test runs indefinitely without OOMing on ext4, ext3 and ext2: I'll move on to test with other filesystems later. Trivia: invert conditions for a clearer block without an else, and goto keep_locked to do the unlock_page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15memcg: prevent OOM with too many dirty pagesMichal Hocko
commit e62e384e9da8d9a0c599795464a7e76fd490931c upstream. The current implementation of dirty pages throttling is not memcg aware which makes it easy to have memcg LRUs full of dirty pages. Without throttling, these LRUs can be scanned faster than the rate of writeback, leading to memcg OOM conditions when the hard limit is small. This patch fixes the problem by throttling the allocating process (possibly a writer) during the hard limit reclaim by waiting on PageReclaim pages. We are waiting only for PageReclaim pages because those are the pages that made one full round over LRU and that means that the writeback is much slower than scanning. The solution is far from being ideal - long term solution is memcg aware dirty throttling - but it is meant to be a band aid until we have a real fix. We are seeing this happening during nightly backups which are placed into containers to prevent from eviction of the real working set. The change affects only memcg reclaim and only when we encounter PageReclaim pages which is a signal that the reclaim doesn't catch up on with the writers so somebody should be throttled. This could be potentially unfair because it could be somebody else from the group who gets throttled on behalf of the writer but as writers need to allocate as well and they allocate in higher rate the probability that only innocent processes would be penalized is not that high. I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and dd got killed by OOM killer every time. With the patch I could run the dd with the same size under 5M controller without any OOM. The issue is more visible with slower devices for output. * With the patch ================ * tmpfs size=2G --------------- $ vim cgroup_cache_oom_test.sh $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s * ext3 ------ $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s * Without the patch =================== * tmpfs size=2G --------------- $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group ./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s * ext3 ------ $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group ./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group ./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n] [hughd@google.com: fix deadlock with loop driver] Reviewed-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faultsTony Luck
commit 6751ed65dc6642af64f7b8a440a75563c8aab7ae upstream. In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine check recovery if it is safe") we fixed mce_notify_process() to force a signal to the current process if it was not restartable (RIPV bit not set in MCG_STATUS). But doing it here means that the process doesn't get told the virtual address of the fault via siginfo_t->si_addr. This would prevent application level recovery from the fault. Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so that we will provide the right information with the signal. Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-17Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge Andrew's remaining patches for 3.5: "Nine fixes" * Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (9 commits) mm: fix lost kswapd wakeup in kswapd_stop() m32r: make memset() global for CONFIG_KERNEL_BZIP2=y m32r: add memcpy() for CONFIG_KERNEL_GZIP=y m32r: consistently use "suffix-$(...)" m32r: fix 'fix breakage from "m32r: use generic ptrace_resume code"' fallout m32r: fix pull clearing RESTORE_SIGMASK into block_sigmask() fallout m32r: remove duplicate definition of PTRACE_O_TRACESYSGOOD mn10300: fix "pull clearing RESTORE_SIGMASK into block_sigmask()" fallout bootmem: make ___alloc_bootmem_node_nopanic() really nopanic
2012-07-17mm: fix lost kswapd wakeup in kswapd_stop()Aaditya Kumar
Offlining memory may block forever, waiting for kswapd() to wake up because kswapd() does not check the event kthread->should_stop before sleeping. The proper pattern, from Documentation/memory-barriers.txt, is: --- waker --- event_indicated = 1; wake_up_process(event_daemon); --- sleeper --- for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (event_indicated) break; schedule(); } set_current_state() may be wrapped by: prepare_to_wait(); In the kswapd() case, event_indicated is kthread->should_stop. === offlining memory (waker) === kswapd_stop() kthread_stop() kthread->should_stop = 1 wake_up_process() wait_for_completion() === kswapd_try_to_sleep (sleeper) === kswapd_try_to_sleep() prepare_to_wait() . . schedule() . . finish_wait() The schedule() needs to be protected by a test of kthread->should_stop, which is wrapped by kthread_should_stop(). Reproducer: Do heavy file I/O in background. Do a memory offline/online in a tight loop Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-17bootmem: make ___alloc_bootmem_node_nopanic() really nopanicYinghai Lu
In reaction to commit 99ab7b19440a ("mm: sparse: fix usemap allocation above node descriptor section") Johannes said: | while backporting the below patch, I realised that your fix busted | f5bf18fa22f8 again. The problem was not a panicking version on | allocation failure but when the usemap size was too large such that | goal + size > limit triggers the BUG_ON in the bootmem allocator. So | we need a version that passes limit ONLY if the usemap is smaller than | the section. after checking the code, the name of ___alloc_bootmem_node_nopanic() does not reflect the fact. Make bootmem really not panic. Hope will kill bootmem sooner. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-17Merge branch 'fixes-for-linus' of ↵Linus Torvalds
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull CMA and DMA-mapping fixes from Marek Szyprowski: "Another set of minor fixups for recently merged Contiguous Memory Allocator and ARM DMA-mapping changes. Those patches fix mysterious crashes on systems with CMA and Himem enabled as well as some corner cases caused by typical off-by-one bug." * 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: ARM: dma-mapping: modify condition check while freeing pages mm: cma: fix condition check when setting global cma area mm: cma: don't replace lowmem pages with highmem
2012-07-11memblock: free allocated memblock_reserved_regions laterYinghai Lu
memblock_free_reserved_regions() calls memblock_free(), but memblock_free() would double reserved.regions too, so we could free the old range for reserved.regions. Also tj said there is another bug which could be related to this. | I don't think we're saving any noticeable | amount by doing this "free - give it to page allocator - reserve | again" dancing. We should just allocate regions aligned to page | boundaries and free them later when memblock is no longer in use. in that case, when DEBUG_PAGEALLOC, will get panic: memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39 BUG: unable to handle kernel paging request at ffff88102febd948 IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155 PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447 RIP: 0010:[<ffffffff836a5774>] [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155 See the discussion at https://lkml.org/lkml/2012/6/13/469 So try to allocate with PAGE_SIZE alignment and free it later. Reported-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm: sparse: fix usemap allocation above node descriptor sectionYinghai Lu
After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint in alloc_bootmem_section"), usemap allocations may easily be placed outside the optimal section that holds the node descriptor, even if there is space available in that section. This results in unnecessary hotplug dependencies that need to have the node unplugged before the section holding the usemap. The reason is that the bootmem allocator doesn't guarantee a linear search starting from the passed allocation goal but may start out at a much higher address absent an upper limit. Fix this by trying the allocation with the limit at the section end, then retry without if that fails. This keeps the fix from f5bf18fa22f8 of not panicking if the allocation does not fit in the section, but still makes sure to try to stay within the section at first. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm: sparse: fix section usemap placement calculationYinghai Lu
Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the bootmem allocator") introduced a bug in the allocation goal calculation that put section usemaps not in the same section as the node descriptors, creating unnecessary hotplug dependencies between them: node 0 must be removed before remove section 16399 node 1 must be removed before remove section 16399 node 2 must be removed before remove section 16399 node 3 must be removed before remove section 16399 node 4 must be removed before remove section 16399 node 5 must be removed before remove section 16399 node 6 must be removed before remove section 16399 The reason is that it applies PAGE_SECTION_MASK to the physical address of the node descriptor when finding a suitable place to put the usemap, when this mask is actually intended to be used with PFNs. Because the PFN mask is wider, the target address will point beyond the wanted section holding the node descriptor and the node must be offlined before the section holding the usemap can go. Fix this by extending the mask to address width before use. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11shmem: cleanup shmem_add_to_page_cacheHugh Dickins
shmem_add_to_page_cache() has three callsites, but only one of them wants the radix_tree_preload() (an exceptional entry guarantees that the radix tree node is present in the other cases), and only that site can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the other cases). We did it this way originally to reflect add_to_page_cache_locked(); but it's confusing now, so move the radix_tree preloading and mem_cgroup uncharging to that one caller. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11shmem: fix negative rss in memcg memory.statHugh Dickins
When adding the page_private checks before calling shmem_replace_page(), I did realize that there is a further race, but thought it too unlikely to need a hurried fix. But independently I've been chasing why a mem cgroup's memory.stat sometimes shows negative rss after all tasks have gone: I expected it to be a stats gathering bug, but actually it's shmem swapping's fault. It's an old surprise, that when you lock_page(lookup_swap_cache(swap)), the page may have been removed from swapcache before getting the lock; or it may have been freed and reused and be back in swapcache; and it can even be using the same swap location as before (page_private same). The swapoff case is already secure against this (swap cannot be reused until the whole area has been swapped off, and a new swapped on); and shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for the expected radix_tree entry - but a little too late. By that time, we might have already decided to shmem_replace_page(): I don't know of a problem from that, but I'd feel more at ease not to do so spuriously. And we have already done mem_cgroup_cache_charge(), on perhaps the wrong mem cgroup: and this charge is not then undone on the error path, because PageSwapCache ends up preventing that. It's this last case which causes the occasional negative rss in memory.stat: the page is charged here as cache, but (sometimes) found to be anon when eventually it's uncharged - and in between, it's an undeserved charge on the wrong memcg. Fix this by adding an earlier check on the radix_tree entry: it's inelegant to descend the tree twice, but swapping is not the fast path, and a better solution would need a pair (try+commit) of memcg calls, and a rework of shmem_replace_page() to keep out of the swapcache. We can use the added shmem_confirm_swap() function to replace the find_get_page+page_cache_release we were already doing on the error path. And add a comment on that -EEXIST: it seems a peculiar errno to be using, but originates from its use in radix_tree_insert(). [It can be surprising to see positive rss left in a memcg's memory.stat after all tasks have gone, since it is supposed to count anonymous but not shmem. Aside from sharing anon pages via fork with a task in some other memcg, it often happens after swapping: because a swap page can't be freed while under writeback, nor while locked. So it's not an error, and these residual pages are easily freed once pressure demands.] Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11tmpfs: revert SEEK_DATA and SEEK_HOLEHugh Dickins
Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"). I believe it's correct, and it's been nice to have from rc1 to rc6; but as the original commit said: I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it would be of any use to them on tmpfs. This code adds 92 lines and 752 bytes on x86_64 - is that bloat or worthwhile? Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to the dumb generic support for v3.5. We can always reinstate it later if useful, and anyone needing it in a hurry can just get it out of git. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Josef Bacik <josef@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Marco Stornelli <marco.stornelli@gmail.com> Cc: Jeff liu <jeff.liu@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() failsWen Congyang
We should goto error to release memory resource if hotadd_new_pgdat() failed. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki ISIMATU <isimatu.yasuaki@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm, thp: abort compaction if migration page cannot be charged to memcgDavid Rientjes
If page migration cannot charge the temporary page to the memcg, migrate_pages() will return -ENOMEM. This isn't considered in memory compaction however, and the loop continues to iterate over all pageblocks trying to isolate and migrate pages. If a small number of very large memcgs happen to be oom, however, these attempts will mostly be futile leading to an enormous amout of cpu consumption due to the page migration failures. This patch will short circuit and fail memory compaction if migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations were successful so that the page allocator will retry. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11memory hotplug: fix invalid memory access caused by stale kswapd pointerJiang Liu
kswapd_stop() is called to destroy the kswapd work thread when all memory of a NUMA node has been offlined. But kswapd_stop() only terminates the work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale pointer will prevent kswapd_run() from creating a new work thread when adding memory to the memory-less NUMA node again. Eventually the stale pointer may cause invalid memory access. An example stack dump as below. It's reproduced with 2.6.32, but latest kernel has the same issue. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051a94>] exit_creds+0x12/0x78 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/memory/memory391/state CPU 11 Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285 RIP: 0010:exit_creds+0x12/0x78 RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10 R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000 R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001 FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600) Stack: ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1 0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003 Call Trace: __put_task_struct+0x5d/0x97 kthread_stop+0x50/0x58 offline_pages+0x324/0x3da memory_block_change_state+0x179/0x1db store_mem_state+0x9e/0xbb sysfs_write_file+0xd0/0x107 vfs_write+0xad/0x169 sys_write+0x45/0x6e system_call_fastpath+0x16/0x1b Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0 RIP exit_creds+0x12/0x78 RSP <ffff8806044f1d78> CR2: 0000000000000000 [akpm@linux-foundation.org: add pglist_data.kswapd locking comments] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-06mm: Hold a file reference in madvise_removeAndy Lutomirski
Otherwise the code races with munmap (causing a use-after-free of the vma) or with close (causing a use-after-free of the struct file). The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix mmap_sem i_mutex deadlock") Cc: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-06mm: cma: don't replace lowmem pages with highmemRabin Vincent
The filesystem layer expects pages in the block device's mapping to not be in highmem (the mapping's gfp mask is set in bdget()), but CMA can currently replace lowmem pages with highmem pages, leading to crashes in filesystem code such as the one below: Unable to handle kernel NULL pointer dereference at virtual address 00000400 pgd = c0c98000 [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000 Internal error: Oops: 817 [#1] PREEMPT SMP ARM CPU: 0 Not tainted (3.5.0-rc5+ #80) PC is at __memzero+0x24/0x80 ... Process fsstress (pid: 323, stack limit = 0xc0cbc2f0) Backtrace: [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98) [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc) r4:c15337f0 [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98) [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac) r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0 [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24) r6:beccdcf0 r5:00074000 r4:beccdbbc [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30) Fix this by replacing only highmem pages with highmem. Reported-by: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2012-07-03Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block bits from Jens Axboe: "As vacation is coming up, thought I'd better get rid of my pending changes in my for-linus branch for this iteration. It contains: - Two patches for mtip32xx. Killing a non-compliant sysfs interface and moving it to debugfs, where it belongs. - A few patches from Asias. Two legit bug fixes, and one killing an interface that is no longer in use. - A patch from Jan, making the annoying partition ioctl warning a bit less annoying, by restricting it to !CAP_SYS_RAWIO only. - Three bug fixes for drbd from Lars Ellenberg. - A fix for an old regression for umem, it hasn't really worked since the plugging scheme was changed in 3.0. - A few fixes from Tejun. - A splice fix from Eric Dumazet, fixing an issue with pipe resizing." * 'for-linus' of git://git.kernel.dk/linux-block: scsi: Silence unnecessary warnings about ioctl to partition block: Drop dead function blk_abort_queue() block: Mitigate lock unbalance caused by lock switching block: Avoid missed wakeup in request waitqueue umem: fix up unplugging splice: fix racy pipe->buffers uses drbd: fix null pointer dereference with on-congestion policy when diskless drbd: fix list corruption by failing but already aborted reads drbd: fix access of unallocated pages and kernel panic xen/blkfront: Add WARN to deal with misbehaving backends. blkcg: drop local variable @q from blkg_destroy() mtip32xx: Create debugfs entries for troubleshooting mtip32xx: Remove 'registers' and 'flags' from sysfs blkcg: fix blkg_alloc() failure path block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED block: fix return value on cfq_init() failure mtip32xx: Remove version.h header file inclusion xen/blkback: Copy id field when doing BLKIF_DISCARD.
2012-06-20mm, mempolicy: fix mbind() to do synchronous migrationDavid Rientjes
If the range passed to mbind() is not allocated on nodes set in the nodemask, it migrates the pages to respect the constraint. The final formal of migrate_pages() is a mode of type enum migrate_mode, not a boolean. do_mbind() is currently passing "true" which is the equivalent of MIGRATE_SYNC_LIGHT. This should instead be MIGRATE_SYNC for synchronous page migration. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm/memblock: fix overlapping allocation when doubling reserved arrayGreg Pearson
__alloc_memory_core_early() asks memblock for a range of memory then try to reserve it. If the reserved region array lacks space for the new range, memblock_double_array() is called to allocate more space for the array. If memblock is used to allocate memory for the new array it can end up using a range that overlaps with the range originally allocated in __alloc_memory_core_early(), leading to possible data corruption. With this patch memblock_double_array() now calls memblock_find_in_range() with a narrowed candidate range (in cases where the reserved.regions array is being doubled) so any memory allocated will not overlap with the original range that was being reserved. The range is narrowed by passing in the starting address and size of the previously allocated range. Then the range above the ending address is searched and if a candidate is not found, the range below the starting address is searched. Signed-off-by: Greg Pearson <greg.pearson@hp.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm/memory.c: fix kernel-doc warningsRandy Dunlap
Fix kernel-doc warnings in mm/memory.c: Warning(mm/memory.c:1377): No description found for parameter 'start' Warning(mm/memory.c:1377): Excess function parameter 'address' description in 'zap_page_range' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm: fix kernel-doc warningsWanpeng Li
Fix kernel-doc warnings such as Warning(../mm/page_cgroup.c:432): No description found for parameter 'id' Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record' Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm, thp: print useful information when mmap_sem is unlocked in zap_pmd_rangeDavid Rientjes
Andrea asked for addr, end, vma->vm_start, and vma->vm_end to be emitted when !rwsem_is_locked(&tlb->mm->mmap_sem). Otherwise, debugging the underlying issue is more difficult. Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20memcg: fix use_hierarchy css_is_ancestor oops regressionHugh Dickins
If use_hierarchy is set, reclaim testing soon oopses in css_is_ancestor() called from __mem_cgroup_same_or_subtree() called from page_referenced(): when processes are exiting, it's easy for mm_match_cgroup() to pass along a NULL memcg coming from a NULL mm->owner. Check for that in __mem_cgroup_same_or_subtree(). Return true or false? False because we cannot know if it was in the hierarchy, but also false because it's better not to count a reference from an exiting process. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm, oom: fix and cleanup oom score calculationsDavid Rientjes
The divide in p->signal->oom_score_adj * totalpages / 1000 within oom_badness() was causing an overflow of the signed long data type. This adds both the root bias and p->signal->oom_score_adj before doing the normalization which fixes the issue and also cleans up the calculation. Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-15swap: fix shmem swapping when more than 8 areasHugh Dickins
Minchan Kim reports that when a system has many swap areas, and tmpfs swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read back the page cannot locate it, and the read fails with -ENOMEM. Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry( swp_entry_to_pte()) technique for determining maximum usable swap offset, without stopping to realize that that actually depends upon the pte swap encoding shifting swap offset to the higher bits and truncating it there. Whereas our radix_tree swap encoding leaves offset in the lower bits: it's swap "type" (that is, index of swap area) that was truncated. Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header(). This does not reduce the usable size of a swap area any further, it leaves it as claimed when making the original commit: no change from 3.0 on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB per swapfile on i386 with PAE. It's not a change I would have risked five years ago, but with x86_64 supported for ten years, I believe it's appropriate now. Hmm, and what if some architecture implements its swap pte with offset encoded below type? That would equally break the maximum usable swap offset check. Happily, they all follow the same tradition of encoding offset above type, but I'll prepare a check on that for next. Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-15Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core updates (RCU and locking) from Ingo Molnar: "Most of the diffstat comes from the RCU slow boot regression fixes, but there's also a debuggability improvements/fixes." * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: memblock: Document memblock_is_region_{memory,reserved}() rcu: Precompute RCU_FAST_NO_HZ timer offsets rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks rcu: RCU_FAST_NO_HZ detection of callback adoption spinlock: Indicate that a lockup is only suspected kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop() panic: Make panic_on_oops configurable
2012-06-13splice: fix racy pipe->buffers usesEric Dumazet
Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Cc: stable <stable@vger.kernel.org> # 2.6.35 Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-08mm, oom: fix badness score underflowDavid Rientjes
If the privileges given to root threads (3% of allowable memory) or a negative value of /proc/pid/oom_score_adj happen to exceed the amount of rss of a thread, its badness score overflows as a result of commit a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only for userspace"). Fix this by making the type signed and return 1, meaning the thread is still eligible for kill, if the value is negative. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-08memblock: Document memblock_is_region_{memory,reserved}()Stephen Boyd
At first glance one would think that memblock_is_region_memory() and memblock_is_region_reserved() would be implemented in the same way. Unfortunately they aren't and the former returns whether the region specified is a subset of a memory bank while the latter returns whether the region specified intersects with reserved memory. Document the two functions so that users aren't tempted to make the implementation the same between them and to clarify the purpose of the functions. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1337845521-32755-1-git-send-email-sboyd@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-07shmem: replace_page must flush_dcache and othersHugh Dickins
Commit bde05d1ccd51 ("shmem: replace page if mapping excludes its zone") is not at all likely to break for anyone, but it was an earlier version from before review feedback was incorporated. Fix that up now. * shmem_replace_page must flush_dcache_page after copy_highpage [akpm] * Expand comment on why shmem_unuse_inode needs page_swapcount [akpm] * Remove excess of VM_BUG_ONs from shmem_replace_page [wangcong] * Check page_private matches swap before calling shmem_replace_page [hughd] * shmem_replace_page allow for unexpected race in radix_tree lookup [hughd] Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Stephane Marchesin <marcheu@chromium.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Rob Clark <rob.clark@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-04Pull 'for-linus' branches of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/{signal,vfs} Pull signal and vfs compile breakage fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: fixups for signal breakage * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nommu: fix compilation of nommu.c
2012-06-04nommu: fix compilation of nommu.cGreg Ungerer
Compiling 3.5-rc1 for nommu targets gives: CC mm/nommu.o mm/nommu.c: In function ‘sys_mmap_pgoff’: mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function) mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in It is trivially fixed by replacing 'ret' with the local variable that is already defined for the return value 'retval'. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-04Merge tag 'stable/frontswap.v16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm Pull frontswap feature from Konrad Rzeszutek Wilk: "Frontswap provides a "transcendent memory" interface for swap pages. In some environments, dramatic performance savings may be obtained because swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. This tag provides the basic infrastructure along with some changes to the existing backends." Fix up trivial conflict in mm/Makefile due to removal of swap token code changing a line next to the new frontswap entry. This pull request came in before the merge window even opened, it got delayed to after the merge window by me just wanting to make sure it had actual users. Apparently IBM is using this on their embedded side, and Jan Beulich says that it's already made available for SLES and OpenSUSE users. Also acked by Rik van Riel, and Konrad points to other people liking it too. So in it goes. By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2) via Konrad Rzeszutek Wilk * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm: frontswap: s/put_page/store/g s/get_page/load MAINTAINER: Add myself for the frontswap API mm: frontswap: config and doc files mm: frontswap: core frontswap functionality mm: frontswap: core swap subsystem hooks and headers mm: frontswap: add frontswap header file
2012-06-03Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks"Linus Torvalds
This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199. That commit seems to be the cause of the mm compation list corruption issues that Dave Jones reported. The locking (or rather, absense there-of) is dubious, as is the use of the 'page' variable once it has been found to be outside the pageblock range. So revert it for now, we can re-visit this for 3.6. If we even need to: as Minchan Kim says, "The patch wasn't a bug fix and even test workload was very theoretical". Reported-and-tested-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03mm: fix warning in __set_page_dirty_nobuffersHugh Dickins
New tmpfs use of !PageUptodate pages for fallocate() is triggering the WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers() is called from migrate_page_copy() for compaction. It is anomalous that migration should use __set_page_dirty_nobuffers() on an address_space that does not participate in dirty and writeback accounting; and this has also been observed to insert surprising dirty tags into a tmpfs radix_tree, despite tmpfs not using tags at all. We should probably give migrate_page_copy() a better way to preserve the tag and migrate accounting info, when mapping_cap_account_dirty(). But that needs some more work: so in the interim, avoid the warning by using a simple SetPageDirty on PageSwapBacked pages. Reported-and-tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01Merge branch 'slab/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull slab updates from Pekka Enberg: "Mainly a bunch of SLUB fixes from Joonsoo Kim" * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slub: use __SetPageSlab function to set PG_slab flag slub: fix a memory leak in get_partial_node() slub: remove unused argument of init_kmem_cache_node() slub: fix a possible memory leak Documentations: Fix slabinfo.c directory in vm/slub.txt slub: fix incorrect return type of get_any_partial()
2012-06-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs changes from Al Viro. "A lot of misc stuff. The obvious groups: * Miklos' atomic_open series; kills the damn abuse of ->d_revalidate() by NFS, which was the major stumbling block for all work in that area. * ripping security_file_mmap() and dealing with deadlocks in the area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in general. * ->encode_fh() switched to saner API; insane fake dentry in mm/cleancache.c gone. * assorted annotations in fs (endianness, __user) * parts of Artem's ->s_dirty work (jff2 and reiserfs parts) * ->update_time() work from Josef. * other bits and pieces all over the place. Normally it would've been in two or three pull requests, but signal.git stuff had eaten a lot of time during this cycle ;-/" Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the 'truncate_range' inode method was removed by the VM changes, the VFS update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due to sparse fix added twice, with other changes nearby). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits) nfs: don't open in ->d_revalidate vfs: retry last component if opening stale dentry vfs: nameidata_to_filp(): don't throw away file on error vfs: nameidata_to_filp(): inline __dentry_open() vfs: do_dentry_open(): don't put filp vfs: split __dentry_open() vfs: do_last() common post lookup vfs: do_last(): add audit_inode before open vfs: do_last(): only return EISDIR for O_CREAT vfs: do_last(): check LOOKUP_DIRECTORY vfs: do_last(): make ENOENT exit RCU safe vfs: make follow_link check RCU safe vfs: do_last(): use inode variable vfs: do_last(): inline walk_component() vfs: do_last(): make exit RCU safe vfs: split do_lookup() Btrfs: move over to use ->update_time fs: introduce inode operation ->update_time reiserfs: get rid of resierfs_sync_super reiserfs: mark the superblock as dirty a bit later ...
2012-06-01fs: introduce inode operation ->update_timeJosef Bacik
Btrfs has to make sure we have space to allocate new blocks in order to modify the inode, so updating time can fail. We've gotten around this by having our own file_update_time but this is kind of a pain, and Christoph has indicated he would like to make xfs do something different with atime updates. So introduce ->update_time, where we will deal with i_version an a/m/c time updates and indicate which changes need to be made. The normal version just does what it has always done, updates the time and marks the inode dirty, and then filesystems can choose to do something different. I've gone through all of the users of file_update_time and made them check for errors with the exception of the fault code since it's complicated and I wasn't quite sure what to do there, also Jan is going to be pushing the file time updates into page_mkwrite for those who have it so that should satisfy btrfs and make it not a big deal to check the file_update_time() return code in the generic fault path. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-01unexport do_munmap()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01new helper: vm_mmap_pgoff()Al Viro
take it to mm/util.c, convert vm_mmap() to use of that one and take it to mm/util.c as well, convert both sys_mmap_pgoff() to use of vm_mmap_pgoff() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01kill do_mmap() completelyAl Viro
just pull into vm_mmap() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01switch aio and shm to do_mmap_pgoff(), make do_mmap() staticAl Viro
after all, 0 bytes and 0 pages is the same thing... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01move security_mmap_addr() to saner placeAl Viro
it really should be done by get_unmapped_area(); that cuts down on the amount of callers considerably and it's the right place for that stuff anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01take security_mmap_file() outside of ->mmap_semAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>