aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2011-06-03tmpfs: fix race between truncate and writepageHugh Dickins
commit 826267cf1e6c6899eda1325a19f1b1d15c558b20 upstream. While running fsx on tmpfs with a memhog then swapoff, swapoff was hanging (interruptibly), repeatedly failing to locate the owner of a 0xff entry in the swap_map. Although shmem_writepage() does abandon when it sees incoming page index is beyond eof, there was still a window in which shmem_truncate_range() could come in between writepage's dropping lock and updating swap_map, find the half-completed swap_map entry, and in trying to free it, leave it in a state that swap_shmem_alloc() could not correct. Arguably a bug in __swap_duplicate()'s and swap_entry_free()'s handling of the different cases, but easiest to fix by moving swap_shmem_alloc() under cover of the lock. More interesting than the bug: it's been there since 2.6.33, why could I not see it with earlier kernels? The mmotm of two weeks ago seems to have some magic for generating races, this is just one of three I found. With yesterday's git I first saw this in mainline, bisected in search of that magic, but the easy reproducibility evaporated. Oh well, fix the bug. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-03mm/page_alloc.c: prevent unending loop in __alloc_pages_slowpath()Andrew Barry
commit cfa54a0fcfc1017c6f122b6f21aaba36daa07f71 upstream. I believe I found a problem in __alloc_pages_slowpath, which allows a process to get stuck endlessly looping, even when lots of memory is available. Running an I/O and memory intensive stress-test I see a 0-order page allocation with __GFP_IO and __GFP_WAIT, running on a system with very little free memory. Right about the same time that the stress-test gets killed by the OOM-killer, the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even though most of the systems memory was freed by the oom-kill of the stress-test. The utility ends up looping from the rebalance label down through the wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact skips the call to get_page_from_freelist. Because all of the reclaimable memory on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then jumps back to rebalance without ever trying to get_page_from_freelist. This loop repeats infinitely. The test case is pretty pathological. Running a mix of I/O stress-tests that do a lot of fork() and consume all of the system memory, I can pretty reliably hit this on 600 nodes, in about 12 hours. 32GB/node. Signed-off-by: Andrew Barry <abarry@cray.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel<riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-03slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpathChristoph Lameter
commit 1393d9a1857471f816d0be1ccc1d6433a86050f6 upstream. Fastpath can do a speculative access to a page that CONFIG_DEBUG_PAGE_ALLOC may have marked as invalid to retrieve the pointer to the next free object. Use probe_kernel_read in that case in order not to cause a page fault. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-03mm: vmscan: correctly check if reclaimer should schedule during shrink_slabMinchan Kim
commit f06590bd718ed950c98828e30ef93204028f3210 upstream. It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() remains in case one shrinker is particularly slow. [mgorman@suse.de: preserve call to cond_resched after each call into shrinker] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-03mm: vmscan: correct use of pgdat_balanced in sleeping_prematurelyJohannes Weiner
commit afc7e326a3f5bafc41324d7926c324414e343ee5 upstream. There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The root cause was bugs introduced into reclaim which are addressed by the following two patches. Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") to allow kswapd to go to sleep when balanced for high orders. Patch 2 notes that it is possible for kswapd to miss every cond_resched() and updates shrink_slab() so it'll at least reach that scheduling point. Chris Wood reports that these two patches in isolation are sufficient to prevent the system hanging. AFAIK, they should also resolve similar hangs experienced by James Bottomley. This patch: Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-03kmemleak: Do not return a pointer to an object that kmemleak did not getCatalin Marinas
commit 52c3ce4ec5601ee383a14f1485f6bac7b278896e upstream. The kmemleak_seq_next() function tries to get an object (and increment its use count) before returning it. If it could not get the last object during list traversal (because it may have been freed), the function should return NULL rather than a pointer to such object that it did not get. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-06-03tmpfs: fix highmem swapoff crash regressionHugh Dickins
commit e6c9366b2adb52cba64b359b3050200743c7568c upstream. Commit 778dd893ae78 ("tmpfs: fix race between umount and swapoff") forgot the new rules for strict atomic kmap nesting, causing WARNING: at arch/x86/mm/highmem_32.c:81 from __kunmap_atomic(), then BUG: unable to handle kernel paging request at fffb9000 from shmem_swp_set() when shmem_unuse_inode() is handling swapoff with highmem in use. My disgrace again. See https://bugzilla.kernel.org/show_bug.cgi?id=35352 Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-05-18memcg: fix zone congestionKAMEZAWA Hiroyuki
ZONE_CONGESTED should be a state of global memory reclaim. If not, a busy memcg sets this and give unnecessary throttoling in wait_iff_congested() against memory recalim in other contexts. This makes system performance bad. I'll think about "memcg is congested!" flag is required or not, later. But this fix is required first. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Ying Han <yinghan@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-16mm: fix kernel-doc warning in page_alloc.cRandy Dunlap
Fix new kernel-doc warning in mm/page_alloc.c: Warning(mm/page_alloc.c:2370): No description found for parameter 'nid' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-14tmpfs: fix race between swapoff and writepageHugh Dickins
Shame on me! Commit b1dea800ac39 "tmpfs: fix race between umount and writepage" fixed the advertized race, but introduced another: as even its comment makes clear, we cannot safely rely on a peek at list_empty() while holding no lock - until info->swapped is set, shmem_unuse_inode() may delete any formerly-swapped inode from the shmem_swaplist, which in this case would leave a swap area impossible to swapoff. Although I don't relish taking the mutex every time, I don't care much for the alternatives either; and at least the peek at list_empty() in shmem_evict_inode() (a hotter path since most inodes would never have been swapped) remains safe, because we already truncated the whole file. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11tmpfs: fix spurious ENOSPC when racing with unswapHugh Dickins
Testing the shmem_swaplist replacements for igrab() revealed another bug: writes to /dev/loop0 on a tmpfs file which fills its filesystem were sometimes failing with "Buffer I/O error"s. These came from ENOSPC failures of shmem_getpage(), when racing with swapoff: the same could happen when racing with another shmem_getpage(), pulling the page in from swap in between our find_lock_page() and our taking the info->lock (though not in the single-threaded loop case). This is unacceptable, and surprising that I've not noticed it before: it dates back many years, but (presumably) was made a lot easier to reproduce in 2.6.36, which sited a page preallocation in the race window. Fix it by rechecking the page cache before settling on an ENOSPC error. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11tmpfs: fix race between umount and swapoffHugh Dickins
The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable to umount as that in shmem_writepage(). Fix this instance by extending the protection of shmem_swaplist_mutex right across shmem_unuse_inode(): while it's on the list, the inode cannot be evicted (and the filesystem cannot be unmounted) without shmem_evict_inode() taking that mutex to remove it from the list. But since shmem_writepage() might take that mutex, we should avoid making memory allocations or memcg charges while holding it: prepare them at the outer level in shmem_unuse(). When mem_cgroup_cache_charge() was originally placed, we didn't know until that point that the page from swap was actually a shmem page; but nowadays it's noted in the swap_map, so we're safe to charge upfront. For the radix_tree, do as is done in shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves on occasion if subsequently preempted. With the allocation and charge moved out from shmem_unuse_inode(), we can also hold index map and info->lock over from finding the entry. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11tmpfs: fix race between umount and writepageHugh Dickins
Konstanin Khlebnikov reports that a dangerous race between umount and shmem_writepage can be reproduced by this script: for i in {1..300} ; do mkdir $i while true ; do mount -t tmpfs none $i dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100)) umount $i done & done on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =) Kernel log: VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds. Have a nice day... WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98() list_del corruption. prev->next should be ffff880222fdaac8, but was (null) Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4 Call Trace: warn_slowpath_common+0x80/0x98 warn_slowpath_fmt+0x41/0x43 __list_del_entry+0x8d/0x98 evict+0x50/0x113 iput+0x138/0x141 ... BUG: unable to handle kernel paging request at ffffffffffffffff IP: shmem_free_blocks+0x18/0x4c Pid: 10422, comm: dd Tainted: G W 2.6.39-rc2+ #4 Call Trace: shmem_recalc_inode+0x61/0x66 shmem_writepage+0xba/0x1dc pageout+0x13c/0x24c shrink_page_list+0x28e/0x4be shrink_inactive_list+0x21f/0x382 ... shmem_writepage() calls igrab() on the inode for the page which came from page reclaim, to add it later into shmem_swaplist for swapoff operation. This igrab() can race with super-block deactivating process: shrink_inactive_list() deactivate_super() pageout() tmpfs_fs_type->kill_sb() shmem_writepage() kill_litter_super() generic_shutdown_super() evict_inodes() igrab() atomic_read(&inode->i_count) skip-inode iput() if (!list_empty(&sb->s_inodes)) printk("VFS: Busy inodes after... This igrap-iput pair was added in commit 1b1b32f2c6f6 "tmpfs: fix shmem_swaplist races" based on incorrect assumptions: igrab() protects the inode from concurrent eviction by deletion, but it does nothing to protect it from concurrent unmounting, which goes ahead despite the raised i_count. So this use of igrab() was wrong all along, but the race made much worse in 2.6.37 when commit 63997e98a3be "split invalidate_inodes()" replaced two attempts at invalidate_inodes() by a single evict_inodes(). Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure whether it was correct or not; but burnt once by igrab(), I am sure that we don't want to rely more deeply upon externals here. Fix it by adding the inode to shmem_swaplist earlier, while the page lock on page in page cache still secures the inode against eviction, without artifically raising i_count. It was originally added later because shmem_unuse_inode() is liable to remove an inode from the list while it's unswapped; but we can guard against that by taking spinlock before dropping mutex. Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11memcg: allocate memory cgroup structures in local nodesAndi Kleen
Commit dde79e005a769 ("page_cgroup: reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM") added a regression that the memory cgroup data structures all end up in node 0 because the first attempt at allocating them would not pass in a node hint. Since the initialization runs on CPU #0 it would all end up node 0. This is a problem on large memory systems, where node 0 would lose a lot of memory. Change the alloc_pages_exact() to alloc_pages_exact_nid(). This will still fall back to other nodes if not enough memory is available. [ RED-PEN: right now it would fall back first before trying vmalloc_node. Probably not the best strategy ... But I left it like that for now. ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Reported-by: Doug Nelson Cc: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11mm: add alloc_pages_exact_nid()Andi Kleen
Add a alloc_pages_exact_nid() that allocates on a specific node. The naming is quite broken, but fixing that would need a larger renaming action. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: tweak comment] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11mm: use alloc_bootmem_node_nopanic() on really needed pathYinghai Lu
Stefan found nobootmem does not work on his system that has only 8M of RAM. This causes an early panic: BIOS-provided physical RAM map: BIOS-88: 0000000000000000 - 000000000009f000 (usable) BIOS-88: 0000000000100000 - 0000000000840000 (usable) bootconsole [earlyser0] enabled Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS! DMI not present or invalid. last_pfn = 0x840 max_arch_pfn = 0x100000 init_memory_mapping: 0000000000000000-0000000000840000 8MB LOWMEM available. mapped low ram: 0 - 00840000 low ram: 0 - 00840000 Zone PFN ranges: DMA 0x00000001 -> 0x00001000 Normal empty Movable zone start PFN for each node early_node_map[2] active PFN ranges 0: 0x00000001 -> 0x0000009f 0: 0x00000100 -> 0x00000840 BUG: Int 6: CR2 (null) EDI c034663c ESI (null) EBP c0329f38 ESP c0329ef4 EBX c0346380 EDX 00000006 ECX ffffffff EAX fffffff4 err (null) EIP c0353191 CS c0320060 flg 00010082 Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null) 00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null) (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null) Pid: 0, comm: swapper Not tainted 2.6.36 #5 Call Trace: [<c02e3707>] ? 0xc02e3707 [<c035e6e5>] 0xc035e6e5 [<c0353191>] ? 0xc0353191 [<c03534d6>] 0xc03534d6 [<c034f1cd>] 0xc034f1cd [<c034a824>] 0xc034a824 [<c03513cb>] ? 0xc03513cb [<c0349432>] 0xc0349432 [<c0349066>] 0xc0349066 It turns out that we should ignore the low limit of 16M. Use alloc_bootmem_node_nopanic() in this case. [akpm@linux-foundation.org: less mess] Signed-off-by: Yinghai LU <yinghai@kernel.org> Reported-by: Stefan Hellermann <stefan@the2masters.de> Tested-by: Stefan Hellermann <stefan@the2masters.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-11mm: check PageUnevictable in lru_deactivate_fn()Minchan Kim
The lru_deactivate_fn should not move page which in on unevictable lru into inactive list. Otherwise, we can meet BUG when we use isolate_lru_pages as __isolate_lru_page could return -EINVAL. Reported-by: Ying Han <yinghan@google.com> Tested-by: Ying Han <yinghan@google.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel<riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09vm: fix vm_pgoff wrap in upward expansionHugh Dickins
Commit a626ca6a6564 ("vm: fix vm_pgoff wrap in stack expansion") fixed the case of an expanding mapping causing vm_pgoff wrapping when you had downward stack expansion. But there was another case where IA64 and PA-RISC expand mappings: upward expansion. This fixes that case too. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09Don't lock guardpage if the stack is growing upMikulas Patocka
Linux kernel excludes guard page when performing mlock on a VMA with down-growing stack. However, some architectures have up-growing stack and locking the guard page should be excluded in this case too. This patch fixes lvm2 on PA-RISC (and possibly other architectures with up-growing stack). lvm2 calculates number of used pages when locking and when unlocking and reports an internal error if the numbers mismatch. [ Patch changed fairly extensively to also fix /proc/<pid>/maps for the grows-up case, and to move things around a bit to clean it all up and share the infrstructure with the /proc bits. Tested on ia64 that has both grow-up and grow-down segments - Linus ] Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Tested-by: Tony Luck <tony.luck@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-04VM: skip the stack guard page lookup in get_user_pages only for mlockLinus Torvalds
The logic in __get_user_pages() used to skip the stack guard page lookup whenever the caller wasn't interested in seeing what the actual page was. But Michel Lespinasse points out that there are cases where we don't care about the physical page itself (so 'pages' may be NULL), but do want to make sure a page is mapped into the virtual address space. So using the existence of the "pages" array as an indication of whether to look up the guard page or not isn't actually so great, and we really should just use the FOLL_MLOCK bit. But because that bit was only set for the VM_LOCKED case (and not all vma's necessarily have it, even for mlock()), we couldn't do that originally. Fix that by moving the VM_LOCKED check deeper into the call-chain, which actually simplifies many things. Now mlock() gets simpler, and we can also check for FOLL_MLOCK in __get_user_pages() and the code ends up much more straightforward. Reported-and-reviewed-by: Michel Lespinasse <walken@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-04slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchgThomas Gleixner
The SLUB allocator use of the cmpxchg_double logic was wrong: it actually needs the irq-safe one. That happens automatically when we use the native unlocked 'cmpxchg8b' instruction, but when compiling the kernel for older x86 CPUs that do not support that instruction, we fall back to the generic emulation code. And if you don't specify that you want the irq-safe version, the generic code ends up just open-coding the cmpxchg8b equivalent without any protection against interrupts or preemption. Which definitely doesn't work for SLUB. This was reported by Werner Landgraf <w.landgraf@ru.ru>, who saw instability with his distro-kernel that was compiled to support pretty much everything under the sun. Most big Linux distributions tend to compile for PPro and later, and would never have noticed this problem. This also fixes the prototypes for the irqsafe cmpxchg_double functions to use 'bool' like they should. [ Btw, that whole "generic code defaults to no protection" design just sounds stupid - if the code needs no protection, there is no reason to use "cmpxchg_double" to begin with. So we should probably just remove the unprotected version entirely as pointless. - Linus ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by: werner <w.landgraf@ru.ru> Acked-and-tested-by: Ingo Molnar <mingo@elte.hu> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28mm: check if PTE is already allocated during page faultMel Gorman
With transparent hugepage support, handle_mm_fault() has to be careful that a normal PMD has been established before handling a PTE fault. To achieve this, it used __pte_alloc() directly instead of pte_alloc_map as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is called once it is known the PMD is safe. pte_alloc_map() is smart enough to check if a PTE is already present before calling __pte_alloc but this check was lost. As a consequence, PTEs may be allocated unnecessarily and the page table lock taken. Thi useless PTE does get cleaned up but it's a performance hit which is visible in page_test from aim9. This patch simply re-adds the check normally done by pte_alloc_map to check if the PTE needs to be allocated before taking the page table lock. The effect is noticable in page_test from aim9. AIM9 2.6.38-vanilla 2.6.38-checkptenone creat-clo 446.10 ( 0.00%) 424.47 (-5.10%) page_test 38.10 ( 0.00%) 42.04 ( 9.37%) brk_test 52.45 ( 0.00%) 51.57 (-1.71%) exec_test 382.00 ( 0.00%) 456.90 (16.39%) fork_test 60.11 ( 0.00%) 67.79 (11.34%) MMTests Statistics: duration Total Elapsed Time (seconds) 611.90 612.22 (While this affects 2.6.38, it is a performance rather than a functional bug and normally outside the rules -stable. While the big performance differences are to a microbench, the difference in fork and exec performance may be significant enough that -stable wants to consider the patch) Reported-by: Raz Ben Yehuda <raziebe@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28oom: use pte pages in OOM scoreKOSAKI Motohiro
PTE pages eat up memory just like anything else, but we do not account for them in any way in the OOM scores. They are also _guaranteed_ to get freed up when a process is OOM killed, while RSS is not. Reported-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanupsAndrea Arcangeli
The huge_memory.c THP page fault was allowed to run if vm_ops was null (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't setup a special vma->vm_ops and it would fallback to regular anonymous memory) but other THP logics weren't fully activated for vmas with vm_file not NULL (/dev/zero has a not NULL vma->vm_file). So this removes the vm_file checks so that /dev/zero also can safely use THP (the other albeit safer approach to fix this bug would have been to prevent the THP initial page fault to run if vm_file was set). After removing the vm_file checks, this also makes huge_memory.c stricter in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should only be allowed to exist before the first page fault, and in turn when vma->anon_vma is null (so preventing khugepaged registration). So I tend to think the previous comment saying if vm_file was set, VM_PFNMAP might have been set and we could still be registered in khugepaged (despite anon_vma was not NULL to be registered in khugepaged) was too paranoid. The is_linear_pfn_mapping check is also I think superfluous (as described by comment) but under DEBUG_VM it is safe to stay. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682 Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Caspar Zhang <bugs@casparzhang.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14mm/thp: use conventional format for boolean attributesBen Hutchings
The conventional format for boolean attributes in sysfs is numeric ("0" or "1" followed by new-line). Any boolean attribute can then be read and written using a generic function. Using the strings "yes [no]", "[yes] no" (read), "yes" and "no" (write) will frustrate this. [akpm@linux-foundation.org: use kstrtoul()] [akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Tested-by: David Rientjes <rientjes@google.com> Cc: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14oom-kill: remove boost_dying_task_prio()KOSAKI Motohiro
This is an almost-revert of commit 93b43fa ("oom: give the dying task a higher priority"). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14vmscan: all_unreclaimable() use zone->all_unreclaimable as a nameKOSAKI Motohiro
all_unreclaimable check in direct reclaim has been introduced at 2.6.19 by following commit. 2006 Sep 25; commit 408d8544; oom: use unreclaimable info And it went through strange history. firstly, following commit broke the logic unintentionally. 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of costly-order allocations Two years later, I've found obvious meaningless code fragment and restored original intention by following commit. 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages() return value when priority==0 But, the logic didn't works when 32bit highmem system goes hibernation and Minchan slightly changed the algorithm and fixed it . 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable in direct reclaim path But, recently, Andrey Vagin found the new corner case. Look, struct zone { .. int all_unreclaimable; .. unsigned long pages_scanned; .. } zone->all_unreclaimable and zone->pages_scanned are neigher atomic variables nor protected by lock. Therefore zones can become a state of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case, current all_unreclaimable() return false even though zone->all_unreclaimabe=1. This resulted in the kernel hanging up when executing a loop of the form 1. fork 2. mmap 3. touch memory 4. read memory 5. munmmap as described in http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725 Is this ignorable minor issue? No. Unfortunately, x86 has very small dma zone and it become zone->all_unreclamble=1 easily. and if it become all_unreclaimable=1, it never restore all_unreclaimable=0. Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan at all! Eventually, oom-killer never works on such systems. That said, we can't use zone->pages_scanned for this purpose. This patch restore all_unreclaimable() use zone->all_unreclaimable as old. and in addition, to add oom_killer_disabled check to avoid reintroduce the issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path"). Reported-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14mm: check that we have the right vma in __access_remote_vm()Michael Ellerman
In __access_remote_vm() we need to check that we have found the right vma, not the following vma before we try to access it. Otherwise we might call the vma's access routine with an address which does not fall inside the vma. It was discovered on a current kernel but with an unreleased driver, from memory it was strace leading to a kernel bad access, but it obviously depends on what the access implementation does. Looking at other access implementations I only see: $ git grep -A 5 vm_operations|grep access arch/powerpc/platforms/cell/spufs/file.c- .access = spufs_mem_mmap_access, arch/x86/pci/i386.c- .access = generic_access_phys, drivers/char/mem.c- .access = generic_access_phys fs/sysfs/bin.c- .access = bin_access, The spufs one looks like it might behave badly given the wrong vma, it assumes vma->vm_file->private_data is a spu_context, and looks like it would probably blow up pretty quickly if it wasn't. generic_access_phys() only uses the vma to check vm_flags and get the mm, and then walks page tables using the address. So it should bail on the vm_flags check, or at worst let you access some other VM_IO mapping. And bin_access() just proxies to another access implementation. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14brk: COMPAT_BRK: fix detection of randomized brkJiri Kosina
5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK") tried to get the whole logic of brk randomization for legacy (libc5-based) applications finally right. It turns out that the way to detect whether brk has actually been randomized in the end or not introduced by that patch still doesn't work for those binaries, as reported by Geert: : /sbin/init from my old m68k ramdisk exists prematurely. : : Before the patch: : : | brk(0x80005c8e) = 0x80006000 : : After the patch: : : | brk(0x80005c8e) = 0x80005c8e : : Old libc5 considers brk() to have failed if the return value is not : identical to the requested value. I don't like it, but currently see no better option than a bit flag in task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2 case. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14tmpfs: fix off-by-one in max_blocks checksHugh Dickins
If you fill up a tmpfs, df was showing tmpfs 460800 - - - /tmp because of an off-by-one in the max_blocks checks. Fix it so df shows tmpfs 460800 460800 0 100% /tmp Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14mm: add VM counters for transparent hugepagesAndi Kleen
I found it difficult to make sense of transparent huge pages without having any counters for its actions. Add some counters to vmstat for allocation of transparent hugepages and fallback to smaller pages. Optional patch, but useful for development and understanding the system. Contains improvements from Andrea Arcangeli and Johannes Weiner [akpm@linux-foundation.org: coding-style fixes] [hannes@cmpxchg.org: fix vmstat_text[] entries] Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14vmstat: update comment regarding stat_thresholdChristoph Lameter
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14mm/page_alloc.c: silence build_all_zonelists() section mismatchPaul Mundt
The memory hotplug case involves calling to build_all_zonelists() which in turns calls in to setup_zone_pageset(). The latter is marked __meminit while build_all_zonelists() itself has no particular annotation. build_all_zonelists() is only handed a non-NULL pointer in the case of memory hotplug through an existing __meminit path, so the setup_zone_pageset() reference is always safe. The options as such are either to flag build_all_zonelists() as __ref (as per __build_all_zonelists()), or to simply discard the __meminit annotation from setup_zone_pageset(). Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14mm: optimize pfn calculation in online_page()Daniel Kiper
If CONFIG_FLATMEM is enabled pfn is calculated in online_page() more than once. It is possible to optimize that and use value established at beginning of that function. Signed-off-by: Daniel Kiper <dkiper@net-space.pl> Acked-by: Dave Hansen <dave@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-13vm: fix vm_pgoff wrap in stack expansionLinus Torvalds
Commit 982134ba6261 ("mm: avoid wrapping vm_pgoff in mremap()") fixed the case of a expanding mapping causing vm_pgoff wrapping when you used mremap. But there was another case where we expand mappings hiding in plain sight: the automatic stack expansion. This fixes that case too. This one also found by Robert Święcki, using his nasty system call fuzzer tool. Good job. Reported-and-tested-by: Robert Święcki <robert@swiecki.net> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-12vm: fix mlock() on stack guard pageLinus Torvalds
Commit 53a7706d5ed8 ("mlock: do not hold mmap_sem for extended periods of time") changed mlock() to care about the exact number of pages that __get_user_pages() had brought it. Before, it would only care about errors. And that doesn't work, because we also handled one page specially in __mlock_vma_pages_range(), namely the stack guard page. So when that case was handled, the number of pages that the function returned was off by one. In particular, it could be zero, and then the caller would end up not making any progress at all. Rather than try to fix up that off-by-one error for the mlock case specially, this just moves the logic to handle the stack guard page into__get_user_pages() itself, thus making all the counts come out right automatically. Reported-by: Robert Święcki <robert@swiecki.net> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-07Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6Linus Torvalds
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings
2011-04-07mm: avoid wrapping vm_pgoff in mremap()Linus Torvalds
The normal mmap paths all avoid creating a mapping where the pgoff inside the mapping could wrap around due to overflow. However, an expanding mremap() can take such a non-wrapping mapping and make it bigger and cause a wrapping condition. Noticed by Robert Swiecki when running a system call fuzzer, where it caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A vma dumping patch by Hugh then pinpointed the crazy wrapped case. Reported-and-tested-by: Robert Swiecki <robert@swiecki.net> Acked-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-29Merge branch 'frv' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv * 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv: FRV: Use generic show_interrupts() FRV: Convert genirq namespace frv: Select GENERIC_HARDIRQS_NO_DEPRECATED frv: Convert cpu irq_chip to new functions frv: Convert mb93493 irq_chip to new functions frv: Convert mb93093 irq_chip to new function frv: Convert mb93091 irq_chip to new functions frv: Fix typo from __do_IRQ overhaul frv: Remove stale irq_chip.end FRV: Do some cleanups FRV: Missing node arg in alloc_thread_info_node() macro NOMMU: implement access_remote_vm NOMMU: support SMP dynamic percpu_alloc NOMMU: percpu should use is_vmalloc_addr().
2011-03-29NOMMU: implement access_remote_vmMike Frysinger
Recent vm changes brought in a new function which the core procfs code utilizes. So implement it for nommu systems too to avoid link failures. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Simon Horman <horms@verge.net.au> Tested-by: Ithamar Adema <ithamar.adema@team-embedded.nl> Acked-by: Greg Ungerer <gerg@uclinux.org>
2011-03-28NOMMU: percpu should use is_vmalloc_addr().David Howells
per_cpu_ptr_to_phys() uses VMALLOC_START and VMALLOC_END to determine if an address is in the vmalloc() region or not. This is incorrect on NOMMU as there is no real vmalloc() capability (vmalloc() is emulated by kmalloc()). The correct way to do this is to use is_vmalloc_addr(). This encapsulates the vmalloc() region test in MMU mode and just returns 0 in NOMMU mode. On FRV in NOMMU mode, the percpu compilation fails without this patch: mm/percpu.c: In function 'per_cpu_ptr_to_phys': mm/percpu.c:1011: error: 'VMALLOC_START' undeclared (first use in this function) mm/percpu.c:1011: error: (Each undeclared identifier is reported only once mm/percpu.c:1011: error: for each function it appears in.) mm/percpu.c:1012: error: 'VMALLOC_END' undeclared (first use in this function) mm/percpu.c:1018: warning: control reaches end of non-void function Signed-off-by: David Howells <dhowells@redhat.com>
2011-03-27mm: fix memory.c incorrect kernel-docRandy Dunlap
Fix mm/memory.c incorrect kernel-doc function notation: Warning(mm/memory.c:3718): Cannot understand * @access_remote_vm - access another process' address space on line 3718 - I thought it was a doc line Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: simplify iget & friends fs: pull inode->i_lock up out of writeback_single_inode fs: rename inode_lock to inode_hash_lock fs: move i_wb_list out from under inode_lock fs: move i_sb_list out from under inode_lock fs: remove inode_lock from iput_final and prune_icache fs: Lock the inode LRU list separately fs: factor inode disposal fs: protect inode->i_state with inode->i_lock autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() autofs4 - remove autofs4_lock autofs4 - fix d_manage() return on rcu-walk autofs4 - fix autofs4_expire_indirect() traversal autofs4 - fix dentry leak in autofs4_expire_direct() autofs4 - reinstate last used update on access vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
2011-03-24fs: move i_wb_list out from under inode_lockDave Chinner
Protect the inode writeback list with a new global lock inode_wb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24fs: protect inode->i_state with inode->i_lockDave Chinner
Protect inode state transitions and validity checks with the inode->i_lock. This enables us to make inode state transitions independently of the inode_lock and is the first step to peeling away the inode_lock from the code. This requires that __iget() is done atomically with i_state checks during list traversals so that we don't race with another thread marking the inode I_FREEING between the state check and grabbing the reference. Also remove the unlock_new_inode() memory barrier optimisation required to avoid taking the inode_lock when clearing I_NEW. Simplify the code by simply taking the inode->i_lock around the state change and wakeup. Because the wakeup is no longer tricky, remove the wake_up_inode() function and open code the wakeup where necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24Merge branch 'slab/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLUB: Write to per cpu data when allocating it slub: Fix debugobjects with lockless fastpath
2011-03-24lib, arch: add filter argument to show_mem and fix private implementationsDavid Rientjes
Commit ddd588b5dd55 ("oom: suppress nodes that are not allowed from meminfo on oom kill") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24SLUB: Write to per cpu data when allocating itChristoph Lameter
It turns out that the cmpxchg16b emulation has to access vmalloced percpu memory with interrupts disabled. If the memory has never been touched before then the fault necessary to establish the mapping will not to occur and the kernel will fail on boot. Fix that by reusing the CONFIG_PREEMPT code that writes the cpu number into a field on every cpu. Writing to the per cpu area before causes the mapping to be established before we get to a cmpxchg16b emulation. Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-03-24slub: Fix debugobjects with lockless fastpathThomas Gleixner
On Thu, 24 Mar 2011, Ingo Molnar wrote: > RIP: 0010:[<ffffffff810570a9>] [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260 That's a typical timer crash, but you were unable to debug it with debugobjects because commit d3f661d6 broke those. Cc: Christoph Lameter <cl@linux.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>