From 50bdd430c20566b13d8bc59946184b08f5875de6 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 18 Dec 2012 14:22:04 -0800 Subject: res_counter: return amount of charges after res_counter_uncharge() It is useful to know how many charges are still left after a call to res_counter_uncharge. While it is possible to issue a res_counter_read after uncharge, this can be racy. If we need, for instance, to take some action when the counters drop down to 0, only one of the callers should see it. This is the same semantics as the atomic variables in the kernel. Since the current return value is void, we don't need to worry about anything breaking due to this change: nobody relied on that, and only users appearing from now on will be checking this value. Signed-off-by: Glauber Costa Reviewed-by: Michal Hocko Acked-by: Kamezawa Hiroyuki Acked-by: David Rientjes Cc: Johannes Weiner Cc: Suleiman Souhlal Cc: Tejun Heo Cc: Christoph Lameter Cc: Frederic Weisbecker Cc: Greg Thelen Cc: JoonSoo Kim Cc: Mel Gorman Cc: Pekka Enberg Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/resource_counter.txt | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt index 0c4a344e78f..c4d99ed0b41 100644 --- a/Documentation/cgroups/resource_counter.txt +++ b/Documentation/cgroups/resource_counter.txt @@ -83,16 +83,17 @@ to work with it. res_counter->lock internally (it must be called with res_counter->lock held). The force parameter indicates whether we can bypass the limit. - e. void res_counter_uncharge[_locked] + e. u64 res_counter_uncharge[_locked] (struct res_counter *rc, unsigned long val) When a resource is released (freed) it should be de-accounted from the resource counter it was accounted to. This is called - "uncharging". + "uncharging". The return value of this function indicate the amount + of charges still present in the counter. The _locked routines imply that the res_counter->lock is taken. - f. void res_counter_uncharge_until + f. u64 res_counter_uncharge_until (struct res_counter *rc, struct res_counter *top, unsinged long val) -- cgit v1.2.3 From d5bdae7d59451b9d63303f7794ef32bb76ba6330 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 18 Dec 2012 14:22:22 -0800 Subject: memcg: add documentation about the kmem controller Signed-off-by: Glauber Costa Acked-by: Kamezawa Hiroyuki Acked-by: Michal Hocko Cc: Christoph Lameter Cc: David Rientjes Cc: Frederic Weisbecker Cc: Greg Thelen Cc: Johannes Weiner Cc: JoonSoo Kim Cc: Mel Gorman Cc: Pekka Enberg Cc: Rik van Riel Cc: Suleiman Souhlal Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 59 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 58 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index a25cb3fafeb..5b5b6314377 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -71,6 +71,11 @@ Brief summary of control files. memory.oom_control # set/show oom controls. memory.numa_stat # show the number of memory usage per numa node + memory.kmem.limit_in_bytes # set/show hard limit for kernel memory + memory.kmem.usage_in_bytes # show current kernel memory allocation + memory.kmem.failcnt # show the number of kernel memory usage hits limits + memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded + memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits @@ -268,20 +273,66 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally different than user memory, since it can't be swapped out, which makes it possible to DoS the system by consuming too much of this precious resource. +Kernel memory won't be accounted at all until limit on a group is set. This +allows for existing setups to continue working without disruption. The limit +cannot be set if the cgroup have children, or if there are already tasks in the +cgroup. Attempting to set the limit under those conditions will return -EBUSY. +When use_hierarchy == 1 and a group is accounted, its children will +automatically be accounted regardless of their limit value. + +After a group is first limited, it will be kept being accounted until it +is removed. The memory limitation itself, can of course be removed by writing +-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not +limited. + Kernel memory limits are not imposed for the root cgroup. Usage for the root -cgroup may or may not be accounted. +cgroup may or may not be accounted. The memory used is accumulated into +memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. +(currently only for tcp). +The main "kmem" counter is fed into the main counter, so kmem charges will +also be visible from the user counter. Currently no soft limit is implemented for kernel memory. It is future work to trigger slab reclaim when those limits are reached. 2.7.1 Current Kernel Memory resources accounted +* stack pages: every process consumes some stack pages. By accounting into +kernel memory, we prevent new processes from being created when the kernel +memory usage is too high. + * sockets memory pressure: some sockets protocols have memory pressure thresholds. The Memory Controller allows them to be controlled individually per cgroup, instead of globally. * tcp memory pressure: sockets memory pressure for the tcp protocol. +2.7.3 Common use cases + +Because the "kmem" counter is fed to the main user counter, kernel memory can +never be limited completely independently of user memory. Say "U" is the user +limit, and "K" the kernel limit. There are three possible ways limits can be +set: + + U != 0, K = unlimited: + This is the standard memcg limitation mechanism already present before kmem + accounting. Kernel memory is completely ignored. + + U != 0, K < U: + Kernel memory is a subset of the user memory. This setup is useful in + deployments where the total amount of memory per-cgroup is overcommited. + Overcommiting kernel memory limits is definitely not recommended, since the + box can still run out of non-reclaimable memory. + In this case, the admin could set up K so that the sum of all groups is + never greater than the total memory, and freely set U at the cost of his + QoS. + + U != 0, K >= U: + Since kmem charges will also be fed to the user counter and reclaim will be + triggered for the cgroup for both kinds of memory. This setup gives the + admin a unified view of memory, and it is also useful for people who just + want to track kernel memory usage. + 3. User Interface 0. Configuration @@ -290,6 +341,7 @@ a. Enable CONFIG_CGROUPS b. Enable CONFIG_RESOURCE_COUNTERS c. Enable CONFIG_MEMCG d. Enable CONFIG_MEMCG_SWAP (to use swap extension) +d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) # mount -t tmpfs none /sys/fs/cgroup @@ -406,6 +458,11 @@ About use_hierarchy, see Section 6. Because rmdir() moves all pages to parent, some out-of-use page caches can be moved to the parent. If you want to avoid that, force_empty will be useful. + Also, note that when memory.kmem.limit_in_bytes is set the charges due to + kernel pages will still be seen. This is not considered a failure and the + write will still return success. In this case, it is expected that + memory.kmem.usage_in_bytes == memory.usage_in_bytes. + About use_hierarchy, see Section 6. 5.2 stat file -- cgit v1.2.3 From 92e793495597af4135d94314113bf13eafb0e663 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 18 Dec 2012 14:23:08 -0800 Subject: kmem: add slab-specific documentation about the kmem controller Signed-off-by: Glauber Costa Cc: Christoph Lameter Cc: David Rientjes Cc: Frederic Weisbecker Cc: Greg Thelen Cc: Johannes Weiner Cc: JoonSoo Kim Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Rik van Riel Cc: Suleiman Souhlal Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 5b5b6314377..8b8c28b9864 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -301,6 +301,13 @@ to trigger slab reclaim when those limits are reached. kernel memory, we prevent new processes from being created when the kernel memory usage is too high. +* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy +of each kmem_cache is created everytime the cache is touched by the first time +from inside the memcg. The creation is done lazily, so some objects can still be +skipped while the cache is being created. All objects in a slab page should +belong to the same memcg. This only fails to hold when a task is migrated to a +different memcg during the page allocation by the cache. + * sockets memory pressure: some sockets protocols have memory pressure thresholds. The Memory Controller allows them to be controlled individually per cgroup, instead of globally. -- cgit v1.2.3 From 5bbe1ec11fcf08af086ae214cdd4f4da0f25ca46 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Tue, 18 Dec 2012 14:23:16 -0800 Subject: Documentation: ABI: /sys/devices/system/node/ Describe NUMA node sysfs files/attributes. Note that for the specific dates and contacts I couldn't find, I left it as default for Oct 2002 and linux-mm. Signed-off-by: Davidlohr Bueso Cc: Greg Kroah-Hartman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/stable/sysfs-devices-node | 96 ++++++++++++++++++++++++++++- 1 file changed, 95 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 49b82cad700..ce259c13c36 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -1,7 +1,101 @@ +What: /sys/devices/system/node/possible +Date: October 2002 +Contact: Linux Memory Management list +Description: + Nodes that could be possibly become online at some point. + +What: /sys/devices/system/node/online +Date: October 2002 +Contact: Linux Memory Management list +Description: + Nodes that are online. + +What: /sys/devices/system/node/has_normal_memory +Date: October 2002 +Contact: Linux Memory Management list +Description: + Nodes that have regular memory. + +What: /sys/devices/system/node/has_cpu +Date: October 2002 +Contact: Linux Memory Management list +Description: + Nodes that have one or more CPUs. + +What: /sys/devices/system/node/has_high_memory +Date: October 2002 +Contact: Linux Memory Management list +Description: + Nodes that have regular or high memory. + Depends on CONFIG_HIGHMEM. + What: /sys/devices/system/node/nodeX Date: October 2002 Contact: Linux Memory Management list Description: When CONFIG_NUMA is enabled, this is a directory containing information on node X such as what CPUs are local to the - node. + node. Each file is detailed next. + +What: /sys/devices/system/node/nodeX/cpumap +Date: October 2002 +Contact: Linux Memory Management list +Description: + The node's cpumap. + +What: /sys/devices/system/node/nodeX/cpulist +Date: October 2002 +Contact: Linux Memory Management list +Description: + The CPUs associated to the node. + +What: /sys/devices/system/node/nodeX/meminfo +Date: October 2002 +Contact: Linux Memory Management list +Description: + Provides information about the node's distribution and memory + utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.txt + +What: /sys/devices/system/node/nodeX/numastat +Date: October 2002 +Contact: Linux Memory Management list +Description: + The node's hit/miss statistics, in units of pages. + See Documentation/numastat.txt + +What: /sys/devices/system/node/nodeX/distance +Date: October 2002 +Contact: Linux Memory Management list +Description: + Distance between the node and all the other nodes + in the system. + +What: /sys/devices/system/node/nodeX/vmstat +Date: October 2002 +Contact: Linux Memory Management list +Description: + The node's zoned virtual memory statistics. + This is a superset of numastat. + +What: /sys/devices/system/node/nodeX/compact +Date: February 2010 +Contact: Mel Gorman +Description: + When this file is written to, all memory within that node + will be compacted. When it completes, memory will be freed + into blocks which have as many contiguous pages as possible + +What: /sys/devices/system/node/nodeX/scan_unevictable_pages +Date: October 2008 +Contact: Lee Schermerhorn +Description: + When set, it triggers scanning the node's unevictable lists + and move any pages that have become evictable onto the respective + zone's inactive list. See mm/vmscan.c + +What: /sys/devices/system/node/nodeX/hugepages/hugepages-/ +Date: December 2009 +Contact: Lee Schermerhorn +Description: + The node's huge page size control/query attributes. + See Documentation/vm/hugetlbpage.txt \ No newline at end of file -- cgit v1.2.3