From d006ab31cd818f5e4dda2453fd09767063f49933 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 29 May 2012 15:06:45 -0700 Subject: mm: consider all swapped back pages in used-once logic commit e48982734ea0500d1eba4f9d96195acc5406cad6 upstream. Commit 645747462435 ("vmscan: detect mapped file pages used only once") made mapped pages have another round in inactive list because they might be just short lived and so we could consider them again next time. This heuristic helps to reduce pressure on the active list with a streaming IO worklods. This patch fixes a regression introduced by this commit for heavy shmem based workloads because unlike Anon pages, which are excluded from this heuristic because they are usually long lived, shmem pages are handled as a regular page cache. This doesn't work quite well, unfortunately, if the workload is mostly backed by shmem (in memory database sitting on 80% of memory) with a streaming IO in the background (backup - up to 20% of memory). Anon inactive list is full of (dirty) shmem pages when watermarks are hit. Shmem pages are kept in the inactive list (they are referenced) in the first round and it is hard to reclaim anything else so we reach lower scanning priorities very quickly which leads to an excessive swap out. Let's fix this by excluding all swap backed pages (they tend to be long lived wrt. the regular page cache anyway) from used-once heuristic and rather activate them if they are referenced. The customer's workload is shmem backed database (80% of RAM) and they are measuring transactions/s with an IO in the background (20%). Transactions touch more or less random rows in the table. The transaction rate fell by a factor of 3 (in the worst case) because of commit 64574746. This patch restores the previous numbers. Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Cc: Mel Gorman Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index cb33d9cd4d65..fbe2d2cebed9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -697,7 +697,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; if (referenced_ptes) { - if (PageAnon(page)) + if (PageSwapBacked(page)) return PAGEREF_ACTIVATE; /* * All mapped pages start out with page table -- cgit v1.2.3 From 73436db332d5b4dd792f115cf0b500521badf3e5 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 18 May 2012 11:46:30 -0700 Subject: hugetlb: fix resv_map leak in error path commit c50ac050811d6485616a193eb0f37bfbd191cc89 upstream. When called for anonymous (non-shared) mappings, hugetlb_reserve_pages() does a resv_map_alloc(). It depends on code in hugetlbfs's vm_ops->close() to release that allocation. However, in the mmap() failure path, we do a plain unmap_region() without the remove_vma() which actually calls vm_ops->close(). This is a decent fix. This leak could get reintroduced if new code (say, after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return an error. But, I think it would have to unroll the reservation anyway. Christoph's test case: http://marc.info/?l=linux-mm&m=133728900729735 Signed-off-by: Dave Hansen [Christoph Lameter: I have rediffed the patch against 2.6.32 and 3.2.0.] Signed-off-by: Ben Hutchings --- mm/hugetlb.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7120c2e2cf82..c715bb916058 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2068,6 +2068,15 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma) kref_get(&reservations->refs); } +static void resv_map_put(struct vm_area_struct *vma) +{ + struct resv_map *reservations = vma_resv_map(vma); + + if (!reservations) + return; + kref_put(&reservations->refs, resv_map_release); +} + static void hugetlb_vm_op_close(struct vm_area_struct *vma) { struct hstate *h = hstate_vma(vma); @@ -2083,7 +2092,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) reserve = (end - start) - region_count(&reservations->regions, start, end); - kref_put(&reservations->refs, resv_map_release); + resv_map_put(vma); if (reserve) { hugetlb_acct_memory(h, -reserve); @@ -2884,12 +2893,16 @@ int hugetlb_reserve_pages(struct inode *inode, set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } - if (chg < 0) - return chg; + if (chg < 0) { + ret = chg; + goto out_err; + } /* There must be enough filesystem quota for the mapping */ - if (hugetlb_get_quota(inode->i_mapping, chg)) - return -ENOSPC; + if (hugetlb_get_quota(inode->i_mapping, chg)) { + ret = -ENOSPC; + goto out_err; + } /* * Check enough hugepages are available for the reservation. @@ -2898,7 +2911,7 @@ int hugetlb_reserve_pages(struct inode *inode, ret = hugetlb_acct_memory(h, chg); if (ret < 0) { hugetlb_put_quota(inode->i_mapping, chg); - return ret; + goto out_err; } /* @@ -2915,6 +2928,9 @@ int hugetlb_reserve_pages(struct inode *inode, if (!vma || vma->vm_flags & VM_MAYSHARE) region_add(&inode->i_mapping->private_list, from, to); return 0; +out_err: + resv_map_put(vma); + return ret; } void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) -- cgit v1.2.3 From 15db5b6a476408a1a4b165e1786874be5aa8e242 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 10 Jan 2012 15:08:39 -0800 Subject: mm/vmalloc.c: change void* into explict vm_struct* commit db1aecafef58b5dda39c4228debe2c845e4a27ab upstream. vmap_area->private is void* but we don't use the field for various purpose but use only for vm_struct. So change it to a vm_struct* with naming to improve for readability and type checking. Signed-off-by: Minchan Kim Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings --- mm/vmalloc.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 27be2f0d4cb7..327a17dfb464 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -256,7 +256,7 @@ struct vmap_area { struct rb_node rb_node; /* address sorted rbtree */ struct list_head list; /* address sorted list */ struct list_head purge_list; /* "lazy purge" list */ - void *private; + struct vm_struct *vm; struct rcu_head rcu_head; }; @@ -1260,7 +1260,7 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, vm->addr = (void *)va->va_start; vm->size = va->va_end - va->va_start; vm->caller = caller; - va->private = vm; + va->vm = vm; va->flags |= VM_VM_AREA; } @@ -1383,7 +1383,7 @@ static struct vm_struct *find_vm_area(const void *addr) va = find_vmap_area((unsigned long)addr); if (va && va->flags & VM_VM_AREA) - return va->private; + return va->vm; return NULL; } @@ -1402,7 +1402,7 @@ struct vm_struct *remove_vm_area(const void *addr) va = find_vmap_area((unsigned long)addr); if (va && va->flags & VM_VM_AREA) { - struct vm_struct *vm = va->private; + struct vm_struct *vm = va->vm; if (!(vm->flags & VM_UNLIST)) { struct vm_struct *tmp, **p; -- cgit v1.2.3 From 39083c6db7e18732e872fed0c2f2427de12a622a Mon Sep 17 00:00:00 2001 From: KyongHo Date: Tue, 29 May 2012 15:06:49 -0700 Subject: mm: fix faulty initialization in vmalloc_init() commit dbda591d920b4c7692725b13e3f68ecb251e9080 upstream. The transfer of ->flags causes some of the static mapping virtual addresses to be prematurely freed (before the mapping is removed) because VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might cause subsequent vmalloc/ioremap calls to fail because it might allocate one of the freed virtual address ranges that aren't unmapped. va->flags has different types of flags from tmp->flags. If a region with VM_IOREMAP set is registered with vm_area_add_early(), it will be removed by __purge_vmap_area_lazy(). Fix vmalloc_init() to correctly initialize vmap_area for the given vm_struct. Also initialise va->vm. If it is not set, find_vm_area() for the early vm regions will always fail. Signed-off-by: KyongHo Cho Cc: "Olav Haugan" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings --- mm/vmalloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 327a17dfb464..eeba3bb82e06 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1160,9 +1160,10 @@ void __init vmalloc_init(void) /* Import existing vmlist entries. */ for (tmp = vmlist; tmp; tmp = tmp->next) { va = kzalloc(sizeof(struct vmap_area), GFP_NOWAIT); - va->flags = tmp->flags | VM_VM_AREA; + va->flags = VM_VM_AREA; va->va_start = (unsigned long)tmp->addr; va->va_end = va->va_start + tmp->size; + va->vm = tmp; __insert_vmap_area(va); } -- cgit v1.2.3 From b7b2e9a76804a3ae4c3e9a99b8c56c48b4bc6a50 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 30 May 2012 07:51:07 -0700 Subject: mm: fix vma_resv_map() NULL pointer commit 4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream. hugetlb_reserve_pages() can be used for either normal file-backed hugetlbfs mappings, or MAP_HUGETLB. In the MAP_HUGETLB, semi-anonymous mode, there is not a VMA around. The new call to resv_map_put() assumed that there was, and resulted in a NULL pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 IP: vma_resv_map+0x9/0x30 PGD 141453067 PUD 1421e1067 PMD 0 Oops: 0000 [#1] PREEMPT SMP ... Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36 RIP: vma_resv_map+0x9/0x30 ... Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0) Call Trace: resv_map_put+0xe/0x40 hugetlb_reserve_pages+0xa6/0x1d0 hugetlb_file_setup+0x102/0x2c0 newseg+0x115/0x360 ipcget+0x1ce/0x310 sys_shmget+0x5a/0x60 system_call_fastpath+0x16/0x1b This was reported by Dave Jones, but was reproducible with the libhugetlbfs test cases, so shame on me for not running them in the first place. With this, the oops is gone, and the output of libhugetlbfs's run_tests.py is identical to plain 3.4 again. [ Marked for stable, since this was introduced by commit c50ac050811d ("hugetlb: fix resv_map leak in error path") which was also marked for stable ] Reported-by: Dave Jones Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: Christoph Lameter Cc: Andrea Arcangeli Cc: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings --- mm/hugetlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c715bb916058..5f5c545cdf06 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2929,7 +2929,8 @@ int hugetlb_reserve_pages(struct inode *inode, region_add(&inode->i_mapping->private_list, from, to); return 0; out_err: - resv_map_put(vma); + if (vma) + resv_map_put(vma); return ret; } -- cgit v1.2.3 From dd881278ba354640bec9f3d6c6dfaef38d2c53b9 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 17 May 2012 00:13:02 +0900 Subject: slub: fix a memory leak in get_partial_node() commit 02d7633fa567be7bf55a993b79d2a31b95ce2227 upstream. In the case which is below, 1. acquire slab for cpu partial list 2. free object to it by remote cpu 3. page->freelist = t then memory leak is occurred. Change acquire_slab() not to zap freelist when it works for cpu partial list. I think it is a sufficient solution for fixing a memory leak. Below is output of 'slabinfo -r kmalloc-256' when './perf stat -r 30 hackbench 50 process 4000 > /dev/null' is done. ***Vanilla*** Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 256 Total : 468 Sanity Checks : Off Total: 3833856 SlabObj: 256 Full : 111 Redzoning : Off Used : 2004992 SlabSiz: 8192 Partial: 302 Poisoning : Off Loss : 1828864 Loss : 0 CpuSlab: 55 Tracking : Off Lalig: 0 Align : 8 Objects: 32 Tracing : Off Lpadd: 0 ***Patched*** Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 256 Total : 300 Sanity Checks : Off Total: 2457600 SlabObj: 256 Full : 204 Redzoning : Off Used : 2348800 SlabSiz: 8192 Partial: 33 Poisoning : Off Loss : 108800 Loss : 0 CpuSlab: 63 Tracking : Off Lalig: 0 Align : 8 Objects: 32 Tracing : Off Lpadd: 0 Total and loss number is the impact of this patch. Acked-by: Christoph Lameter Signed-off-by: Joonsoo Kim Signed-off-by: Pekka Enberg Signed-off-by: Ben Hutchings --- mm/slub.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index a99c785828c6..af47188da4d3 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1506,15 +1506,19 @@ static inline void *acquire_slab(struct kmem_cache *s, freelist = page->freelist; counters = page->counters; new.counters = counters; - if (mode) + if (mode) { new.inuse = page->objects; + new.freelist = NULL; + } else { + new.freelist = freelist; + } VM_BUG_ON(new.frozen); new.frozen = 1; } while (!__cmpxchg_double_slab(s, page, freelist, counters, - NULL, new.counters, + new.freelist, new.counters, "lock and freeze")); remove_partial(n, page); @@ -1556,7 +1560,6 @@ static void *get_partial_node(struct kmem_cache *s, object = t; available = page->objects - page->inuse; } else { - page->freelist = t; available = put_cpu_partial(s, page, 0); } if (kmem_cache_debug(s) || available > s->cpu_partial / 2) -- cgit v1.2.3