From 772616b031f06e05846488b01dab46a7c832da13 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 11 Aug 2020 18:30:21 -0700 Subject: mm: memcg/percpu: per-memcg percpu memory statistics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Percpu memory can represent a noticeable chunk of the total memory consumption, especially on big machines with many CPUs. Let's track percpu memory usage for each memcg and display it in memory.stat. A percpu allocation is usually scattered over multiple pages (and nodes), and can be significantly smaller than a page. So let's add a byte-sized counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs can be perfectly reused for percpu case. [guro@fb.com: v3] Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Dennis Zhou Acked-by: Johannes Weiner Cc: Christoph Lameter Cc: David Rientjes Cc: Joonsoo Kim Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Tejun Heo Cc: Tobin C. Harding Cc: Vlastimil Babka Cc: Waiman Long Cc: Bixuan Cui Cc: Michal Koutný Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index fa4018afa5a4..6be43781ec7f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back. Amount of memory used for storing in-kernel data structures. + percpu + Amount of memory used for storing per-cpu kernel + data structures. + sock Amount of memory used in network transmission buffers -- cgit v1.2.3 From facdaa917c4d5a376d09d25865f5a863f906234a Mon Sep 17 00:00:00 2001 From: Nitin Gupta Date: Tue, 11 Aug 2020 18:31:00 -0700 Subject: mm: proactive compaction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta Signed-off-by: Andrew Morton Tested-by: Oleksandr Natalenko Reviewed-by: Vlastimil Babka Reviewed-by: Khalid Aziz Reviewed-by: Oleksandr Natalenko Cc: Vlastimil Babka Cc: Khalid Aziz Cc: Michal Hocko Cc: Mel Gorman Cc: Matthew Wilcox Cc: Mike Kravetz Cc: Joonsoo Kim Cc: David Rientjes Cc: Nitin Gupta Cc: Oleksandr Natalenko Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d997cc3c26d0..4b9d2e8e9142 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness +======================== + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. compact_unevictable_allowed =========================== -- cgit v1.2.3 From de3f32e1424ca5c751d4cd78cbed6d17ca261c24 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 11 Aug 2020 18:31:25 -0700 Subject: doc, mm: sync up oom_score_adj documentation There are at least two notes in the oom section. The 3% discount for root processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes"). Likewise children of the selected oom victim are not sacrificed since bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic") Drop both of them. Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Cc: Jonathan Corbet Cc: David Rientjes Cc: Yafang Shao Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.org Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.rst | 8 -------- 1 file changed, 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index e024a9efffd8..eb96a440a064 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1633,9 +1633,6 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. -There is an additional factor included in the badness score: the current memory -and swap usage is discounted by 3% for root processes. - The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset being exhausted, the allowed memory represents the set of mems assigned to that @@ -1671,11 +1668,6 @@ The value of /proc//oom_score_adj may be reduced no lower than the last value set by a CAP_SYS_RESOURCE process. To reduce the value any lower requires CAP_SYS_RESOURCE. -Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with separate address spaces instead, if possible. This -avoids servers and important system daemons from being killed and loses the -minimal amount of work. - 3.2 /proc//oom_score - Display current oom-killer score ------------------------------------------------------------- -- cgit v1.2.3 From b1aa7c9377bdff904efa0fbde1f96e1769730d8a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 11 Aug 2020 18:31:28 -0700 Subject: doc, mm: clarify /proc//oom_score value range The exported value includes oom_score_adj so the range is no [0, 1000] as described in the previous section but rather [0, 2000]. Mention that fact explicitly. Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Cc: Jonathan Corbet Cc: David Rientjes Cc: Yafang Shao Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.org Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.rst | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index eb96a440a064..533c79e8d2cd 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1676,6 +1676,9 @@ This file can be used to check the current score used by the oom-killer for any given . Use it together with /proc//oom_score_adj to tune which process should be killed in an out-of-memory situation. +Please note that the exported value includes oom_score_adj so it is +effectively in range [0,2000]. + 3.3 /proc//io - Display the IO accounting fields ------------------------------------------------------- -- cgit v1.2.3 From 1a5bae25e3cf95c4e83a97f87a6b5280d9acbb22 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Tue, 11 Aug 2020 18:31:51 -0700 Subject: mm/vmstat: add events for THP migration without split Add following new vmstat events which will help in validating THP migration without split. Statistics reported through these new VM events will help in performance debugging. 1. THP_MIGRATION_SUCCESS 2. THP_MIGRATION_FAILURE 3. THP_MIGRATION_SPLIT In addition, these new events also update normal page migration statistics appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here, this updates current trace event 'mm_migrate_pages' to accommodate now available THP statistics. [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/] [ziy@nvidia.com: v2] Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/] Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Signed-off-by: Zi Yan Signed-off-by: Andrew Morton Reviewed-by: Daniel Jordan Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Zi Yan Cc: John Hubbard Cc: Naoya Horiguchi Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- Documentation/vm/page_migration.rst | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/page_migration.rst b/Documentation/vm/page_migration.rst index 1d6cd7db4e43..68883ac485fa 100644 --- a/Documentation/vm/page_migration.rst +++ b/Documentation/vm/page_migration.rst @@ -253,5 +253,32 @@ which are function pointers of struct address_space_operations. PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag for own purpose. +Monitoring Migration +===================== + +The following events (counters) can be used to monitor page migration. + +1. PGMIGRATE_SUCCESS: Normal page migration success. Each count means that a + page was migrated. If the page was a non-THP page, then this counter is + increased by one. If the page was a THP, then this counter is increased by + the number of THP subpages. For example, migration of a single 2MB THP that + has 4KB-size base pages (subpages) will cause this counter to increase by + 512. + +2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for + _SUCCESS, above: this will be increased by the number of subpages, if it was + a THP. + +3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. + +4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. + +5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had + to be split. After splitting, a migration retry was used for it's sub-pages. + +THP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or +PGMIGRATE_FAIL events. For example, a THP migration failure will cause both +THP_MIGRATION_FAIL and PGMIGRATE_FAIL to increase. + Christoph Lameter, May 8, 2006. Minchan Kim, Mar 28, 2016. -- cgit v1.2.3 From f38c85f1ba6902e4e2e2bf1b84edf065a904cdeb Mon Sep 17 00:00:00 2001 From: Lepton Wu Date: Tue, 11 Aug 2020 18:36:20 -0700 Subject: coredump: add %f for executable filename The document reads "%e" should be "executable filename" while actually it could be changed by things like pr_ctl PR_SET_NAME. People who uses "%e" in core_pattern get surprised when they find out they get thread name instead of executable filename. This is either a bug of document or a bug of code. Since the behavior of "%e" is there for long time, it could bring another surprise for users if we "fix" the code. So we just "fix" the document. And more, for users who really need the "executable filename" in core_pattern, we introduce a new "%f" for the real executable filename. We already have "%E" for executable path in kernel, so just reuse most of its code for the new added "%f" format. Signed-off-by: Lepton Wu Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/kernel.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 2ae9669eb22c..d4b32cc32bb7 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -164,7 +164,8 @@ core_pattern %s signal number %t UNIX time of dump %h hostname - %e executable filename (may be shortened) + %e executable filename (may be shortened, could be changed by prctl etc) + %f executable filename %E executable path %c maximum size of core file by resource limit RLIMIT_CORE % both are dropped -- cgit v1.2.3