summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-11-20Merge remote-tracking branch 'eas-next/eas-next' into sched/cpuidle/eas-nexteas-next-20141120sched/cpuidle/eas-nextDaniel Lezcano
Conflicts: include/linux/sched.h kernel/sched/Makefile
2014-11-20cpuidle: sysfs: Add per cpu idle state prediction statisticsDaniel Lezcano
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20X86: add IPI tracepointsNicolas Pitre
On X86 there are already tracepoints for IRQ vectors through which IPIs are handled. However this is highly X86 specific, and the IPI signaling is not currently traced. This is an attempt at adding generic IPI tracepoints to X86. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: add ftrace for io latency trackingDaniel Lezcano
Add ftraces for debug purpose Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20cpuidle: select: hack - increase rating to have this governor as defaultDaniel Lezcano
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: io_latency: Tracking via bucketsDaniel Lezcano
It was recently added in the energy aware scheduler kernel tree the io latency tracking mechanism. The purpose of this framework is to provide a way to predict the IO latencies, in other words try to guess how long we will be sleeping on waiting an IO. When the cpu goes idle, we know how long is the sleep duration with the timer but then we rely on some statistics in the menu governor, which is part of the cpuidle framework for other wakes up. The io latency tracking will provide an additional information about the length of the expected sleep time, which combined with the timer duration should give us a more accurate prediction. The first step of the io latency tracking was simply using a sliding average of the values, which is not really accurate as it is not immune against IOs ping pong or big variations. In order to improve that, each latency is grouped into a bucket which represent an interval of latency and for each bucket a sliding average is computed. Why ? Because we don't want to take all the latencies and compute the statistics on them. It does not make sense, takes a lot of memory, computation time, for finally a result which is mathematically impossible to resolve. It is better to use intervals to group the small variations of the latencies. For example. 186us, 123us, 134us can fall into the bucket [100 - 199]. The size of the bucket is the bucket interval and represent the resolution of the statistic model. Eg with a bucket interval of 1us, it leads us to do statitics on all numbers, with of course a bad prediction because the number of latencies is big. A big interval can give better statistics, but can give us a misprediction as the interval is larger. Choosing the size of the bucket interval vs the idle sleep time is the tradeoff to find. With a 200us bucket interval, the measurements show we still have good predictions, less mispredictions and cover the idle state target residency. The buckets are dynamically created and stored into a list. A new bucket is added at the end of the list. This list is always moving depending on the number of successives hits a bucket will have. The more a bucket is successively hit, the more it will be the first element of the list. The guessed next latency, which is a bucket (understand it will be between eg. 200us and 300us, with a bucket interval of 100us), is retrieved from the list. Each bucket present in the list will mark a score, the more the hits a bucket has, the bigger score it has. *But* this is weighted by the position in the list. The first elements will have more weight than the last ones. This position is dynamically changed when a bucket is hit several times. Example the following latencies: 10, 100, 100, 100, 100, 100, 10, 10 We will have two buckets: 0 and 1. 10 => bucket0(1) 100 => bucket0(1), bucket1(1) 100 => bucket0(1), bucket1(2) 100 => bucket0(1), bucket1(3) 100 => bucket0(1), bucket1(4) * 100 => bucket1(5), bucket0(1) 10 => bucket1(5), bucket0(2) 10 => bucket1(5), bucket0(3) At (*), bucket1 reached 5 successive hits at has been move at the beginning of the list and bucket0 became the second one. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: idle: cpuidle: Add debugfs to check prediction correctnessDaniel Lezcano
Evaluating the gap between the bet for the next event waking up a cpu and the effective sleep duration is not possible today. This simple patch provides a set of files to evaluate if the predictions are correct or not. There are splitted in 3 categories: * under estimated : the sleep duration was greater than what we were expecting, so we could have been deeper in the idle state. That is true, if the deeper idle state was not disabled and fitting the exit latency requirement. * over estimated : the sleep duration was smaller than what we were expecting and we did not reach the break even for the selected idle state. * well estimated : the excepted sleep duration lead to the correct idle state selection and we got an optimum power energy saving. These informations are exported in the debugfs in the sched/idle directory. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20cpuidle: Add a simple select governorDaniel Lezcano
This simple governor takes into account the predictable events: the timer sleep duration and the next expected IO sleep duration. By mixing both it deduced what idle state fits better. This governor must be extended to a statistical approach to predict all the other events. The main purpose of this governor is to handle the guessed next events in a categorized way: 1. deterministic events : timers 2. guessed events : IOs 3. predictable events : keystroke, incoming network packet, ... This governor is aimed to be moved later near the scheduler, so this one can inspect/inject more informations and act proactively rather than reactively. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Conflicts: drivers/cpuidle/Kconfig
2014-11-20sched: idle: Add io latency information for the next eventDaniel Lezcano
As we want to improve the sleep duration estimation, the IO latency expected duration is passed to the cpuidle framework. The governors will have to deal with if they are interested in this information. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Conflicts: drivers/cpuidle/governors/menu.c
2014-11-20cpuidle: Remove unused headers for tickDaniel Lezcano
Moving around the different functions dealing with the time made the time headers no longer necessary in cpuidle.c. Remove them. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: idle: Compute next timer event and pass it the cpuidle frameworkDaniel Lezcano
Following the logic of the previous patch, retrieve from the idle task the expected timer sleep duration and pass it to the cpuidle framework. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: idle: cpudidle: Pass the latency req from idle.cDaniel Lezcano
As we get the latency_req from cpuidle_idle_call, just pass it to the cpuidle layer instead of duplicating the code across the governors. That has the benefit of moving little by little the different timings we want to integrate with the scheduler near this one. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Conflicts: drivers/cpuidle/governors/menu.c
2014-11-20Checking the zero latency inside the governors does not make sense.Daniel Lezcano
If the zero latency is required, we don't want to invoke any cpuidle code at all. Move the check within the governors and do the check before selecting the state in order to fallback to the default idle function. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: add io latency frameworkDaniel Lezcano
In order to have a good prediction of when will occur the next event, the cpuidle menu governor does some statistics about the occurences of an event waking up a cpu. For more details, refer to the menu.c's header file located in drivers/cpuidle/governors. A part of the prediction is taking into account the number of pending IO on the cpu and depending on this number it will use a 'magic' number to force the selection of shallow states. It makes sense and provided a good improvement in terms of system latencies for server. Unfortunately there are some drawbacks of this approach. The first one is the empirical approach, based on measurements for a specific hardware and architecture giving the magic 'performance multiplier' which may not fit well for different architecture as well as new hardware which evolve during time. The second one is the mix of all the wakeup sources making impossible to track when a task is migrated across the cpus. And the last one, is the lack of correctly tracking what is happening on the system. In order to improve that, we can classify three kind of events: 1. totally predictable events : this is the case for the timers 2. partially predictable events : for example: hard disk accesses with sleep time which are more or less inside a reasonable interval, the same for SSD or SD-card 3. difficult to predict events : network incoming packet, keyboard or mouse event. These ones need a statistical approach. At this moment, 1., 2. and 3. are all mixed in the governors statistics. This patchset provides a simplified version of an io latency tracking mechanism in order to separate and improve the category 2, that is the partially predictable events. As the scheduler is a good place to measure how long a task is blocked in an IO, all the code of this patchset is tied with it. The sched entity and the io latency tracking share the same design: There is a rb tree per cpu. Each time a task is blocked on an IO, it is inserted into the tree. When the IO is complete and the task is woken up, its avg latency is updated with the time spent to wait the IO and it is removed from the tree. The next time, it will be inserted into the tree again in case of io_schedule. If there are several tasks blocked on an IO, the left most node of the tree is the minimal latency, in other words it gives for the IO the next event. This information may need to be balanced regarding the number of pendings IO (the more the disk is accessing, the slower it is). By splitting the three kind of categories we can guess more accurately the next event on the system in conjunction with the next timer and some simplified statistics from the menu governor. Furthermore, that has the benefit to take into account a task migration as the information is embedded with it. This is a simplified version because the tracking could be greatly improved by a polynomial regression like the sched entity tracking and the latencies should also be per device but that implies a bigger impact as the different callers of io_schedule have to provide the dev_t parameter. A too complex patch won't help the understanding. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20sched: idle: Add sched balance optionDaniel Lezcano
This patch adds a sysctl schedule energy aware option to choose against: * energy aware scheduler * old scheduler code path Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-20cpuidle: Store when the cpu went to idleDaniel Lezcano
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2014-11-19sched: energy_model: simple cpu frequency scaling policyeas-next-20141118Mike Turquette
Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model. This policy is implemented using the CPUfreq governor interface for two main reasons: 1) re-using the CPUfreq machine drivers without using the governor interface is hard. I do not forsee any issue continuing to use the governor interface going forward but it is worth making clear what this patch does up front. 2) using the CPUfreq interface allows us to switch between the energy_model governor and other CPUfreq governors (such as ondemand) at run-time. This is very useful for comparative testing and tuning. A caveat to #2 above is that the weak arch function used by the governor means that only one scheduler-driven policy can be linked at a time. This limitation does not apply to "traditional" governors. I raised this in my previous capacity_ops patches[0] but as discussed at LPC14 last week, it seems desirable to pursue a single cpu frequency scaling policy at first, and try to make that work for everyone interested in using it. If that model breaks down then we can revisit the idea of dynamic selection of scheduler-driven cpu frequency scaling. Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation. The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome. Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance. Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
2014-11-19sched: cfs: cpu frequency scaling based on task placementMike Turquette
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy. arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code. All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff. Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this. Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense? Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea. Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
2014-11-19sched: cfs: cpu frequency scaling arch functionsMike Turquette
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context. They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization. As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures. Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
2014-11-19cpufreq: add per-governor private dataMike Turquette
Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Mike Turquette <mturquette@linaro.org>
2014-11-19sched: cfs: declare capacity_of & get_cpu_usage in sched.hMike Turquette
capacity_of and get_cpu_usage useful for cpu frequency scaling policies. Share it via sched.h so that selectable cpu frequency scaling policies can make use of it. Signed-off-by: Mike Turquette <mturquette@linaro.org>
2014-11-19sched: Make energy awareness a sched featureMorten Rasmussen
This patch introduces the ENERGY_AWARE sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set false when SCHED_DEBUG is not defined. Hence this doesn't allow energy awareness to be enabled without SCHED_DEBUG. This sched_feature knob will be replaced later with a more appropriate control knob when things have matured a bit. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Mike Turquette <mturquette@linaro.org> [mturquette@linaro.org: moved energy_aware above enqueue_task_fair]
2014-11-05Merge branch 'sched-for-mike' of ↵Michael Turquette
git://git.linaro.org/people/vincent.guittot/kernel into eas-next
2014-11-04sched: completion: document when to use wait_for_completion_io_*Wolfram Sang
As discussed [1], accounting IO is meant for blkio only. Document that so driver authors won't use them for device io. [1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470 Cc: Ingo Molnar <mingo@redhat.com> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
2014-11-04sched: updated comments of CLONE_NEWUTS and CLONE_NEWIPCChen Hanxiao
Remove question mark: s/New utsname group?/New utsname namespace Unified style for IPC: s/New ipcs/New ipc namespace Cc: Jiri Kosina <trivial@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1415091082-15093-1-git-send-email-chenhanxiao@cn.fujitsu.com
2014-11-04sched: Refactor task_struct to use numa_faults instead of numa_* pointersIulia Manda
This patch simplifies task_struct by removing the four numa_* pointers in the same array and replacing them with the array pointer. By doing this, on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on x86_64). A new parameter is added to the task_faults_idx function so that it can return an index to the correct offset, corresponding with the old precalculated pointers. All of the code in sched/ that depended on task_faults_idx and numa_* was changed in order to match the new logic. Cc: mgorman@suse.de Cc: dave@stgolabs.net Cc: riel@redhat.com Cc: mingo@redhat.com Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
2014-11-04sched/deadline: don't check CONFIG_SMP in switched_from_dlWanpeng Li
There are both UP and SMP version of pull_dl_task(), so don't need to check CONFIG_SMP in switched_from_dl(); Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
2014-11-04sched/deadline: reschedule from switched_from_dl() after a successful pullWanpeng Li
In switched_from_dl() we have to issue a resched if we successfully pulled some task from other cpus. This patch also aligns the behavior with -rt. Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
2014-11-04sched/deadline: push task away if the deadline is equal to curr during wakeupWanpeng Li
This patch pushes task away if the dealine of the task is equal to current during wake up. The same behavior as rt class. Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
2014-11-04sched/deadline: add deadline rq status printWanpeng Li
This patch add deadline rq status print. Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com
2014-11-04sched/deadline: fix artificial overrun introduced by yield_task_dlWanpeng Li
The yield semantic of deadline class is to reduce remaining runtime to zero, and then update_curr_dl() will stop it. However, comsumed bandwidth is reduced from the budget of yield task again even if it has already been set to zero which leads to artificial overrun. This patch fix it by make sure we don't steal some more time from the task that yielded in update_curr_dl(). Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
2014-11-04sched/rt: cleanup check preempt equal prioWanpeng Li
This patch checks if current can be pushed/pulled somewhere else in advance to make logic clear, the same behavior as dl class. - If current can't be migrated, useless to reschedule, let's hope task can move out. - If task is migratable, so let's not schedule it and see if it can be pushed or pulled somewhere else. Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng.li@linux.intel.com
2014-11-04sched/core: dl_bw_of() has to be used under rcu_read_lock_sched()Juri Lelli
As per commit f10e00f4bf36 ("sched/dl: Use dl_bw_of() under rcu_read_lock_sched()"), dl_bw_of() has to be protected by rcu_read_lock_sched(). Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.com
2014-11-04sched:check if got a shallowest_idle_cpu before search for least_loaded_cpuYao Dongdong
Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu after we have found an idle cpu. Cc: <peterz@infradead.org> Cc: <mingo@redhat.com> Signed-off-by: Yao Dongdong <yaodongdong@huawei.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
2014-11-04sched/dl: Implement cancel_dl_timer() to use in switched_from_dl()Kirill Tkhai
Currently used hrtimer_try_to_cancel() is racy: raw_spin_lock(&rq->lock) ... dl_task_timer raw_spin_lock(&rq->lock) ... raw_spin_lock(&rq->lock) ... switched_from_dl() ... ... hrtimer_try_to_cancel() ... ... switched_to_fair() ... ... ... ... ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... ... do_exit() ... ... schedule() ... ... raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock) ... ... ... raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock) ... ... (asquired) put_task_struct() ... ... free_task_struct() ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... (use after free) ... So, let's implement 100% guaranteed way to cancel the timer and let's be sure we are safe even in very unlikely situations. rq unlocking does not limit the area of switched_from_dl() use, because this has already been possible in pull_dl_task() below. Let's consider the safety of of this unlocking. New code in the patch is working when hrtimer_try_to_cancel() fails. This means the callback is running. In this case hrtimer_cancel() is just waiting till the callback is finished. Two 1) Since we are in switched_from_dl(), new class is not dl_sched_class and new prio is not less MAX_DL_PRIO. So, the callback returns early; it's right after !dl_task() check. After that hrtimer_cancel() returns back too. The above is: raw_spin_lock(rq->lock); ... ... dl_task_timer() ... raw_spin_lock(rq->lock); switched_from_dl() ... hrtimer_try_to_cancel() ... raw_spin_unlock(rq->lock); ... hrtimer_cancel() ... ... raw_spin_unlock(rq->lock); ... return HRTIMER_NORESTART; ... ... raw_spin_lock(rq->lock); ... 2) But the below is also possible: dl_task_timer() raw_spin_lock(rq->lock); ... raw_spin_unlock(rq->lock); raw_spin_lock(rq->lock); ... switched_from_dl() ... hrtimer_try_to_cancel() ... ... return HRTIMER_NORESTART; raw_spin_unlock(rq->lock); ... hrtimer_cancel(); ... raw_spin_lock(rq->lock); ... In this case hrtimer_cancel() returns immediately. Very unlikely case, just to mention. Nobody can manipulate the task, because check_class_changed() is always called with pi_lock locked. Nobody can force the task to participate in (concurrent) priority inheritance schemes (the same reason). All concurrent task operations require pi_lock, which is held by us. No deadlocks with dl_task_timer() are possible, because it returns right after !dl_task() check (it does nothing). If we receive a new dl_task during the time of unlocked rq, we just don't have to do pull_dl_task() in switched_from_dl() further. [peterz: Added comments] Acked-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
2014-11-04sched: Use WARN_ONCE for the might_sleep() TASK_RUNNING testPeter Zijlstra
In some cases this can trigger a true flood of output. Requested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2014-11-04netdev: Fix sleeping inside wait eventPeter Zijlstra
rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a wait loop. The wait loop relies on current->state to function, but so does mutex_lock(), nesting them makes for the inner to destroy the outer state. Fix this using the new wait_woken() bits. Cc: Oleg Nesterov <oleg@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Acked-by: David S. Miller <davem@davemloft.net> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.net
2014-11-04rfcomm: Fix broken wait constructPeter Zijlstra
rfcomm_run() is a tad broken in that is has a nested wait loop. One cannot rely on p->state for the outer wait because the inner wait will overwrite it. Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2014-11-04audit,wait: Fixup kauditd_thread wait loopPeter Zijlstra
The kauditd_thread wait loop is a bit iffy; it has a number of problems: - calls try_to_freeze() before schedule(); you typically want the thread to re-evaluate the sleep condition when unfreezing, also freeze_task() issues a wakeup. - it unconditionally does the {add,remove}_wait_queue(), even when the sleep condition is false. Use wait_event_freezable() that does the right thing. Cc: mingo@kernel.org Cc: oleg@redhat.com Cc: torvalds@linux-foundation.org Cc: tglx@linutronix.de Cc: Eric Paris <eparis@redhat.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20141002102251.GA6324@worktop.programming.kicks-ass.net
2014-11-04wait: Remove wait_event_freezekillable()Peter Zijlstra (Intel)
There is no user.. make it go away. Cc: oleg@redhat.com Cc: Rafael Wysocki <rjw@rjwysocki.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2014-11-04wait: Reimplement wait_event_freezable()Peter Zijlstra
Provide better implementations of wait_event_freezable() APIs. The problem is with freezer_do_not_count(), it hides the thread from the freezer, even though this thread might not actually freeze/sleep at all. Cc: oleg@redhat.com Cc: Rafael Wysocki <rjw@rjwysocki.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-d86fz1jmso9wjxa8jfpinp8o@git.kernel.org
2014-11-04sched,wait: Fix a kthread race with wait_woken()Peter Zijlstra
There is a race between kthread_stop() and the new wait_woken() that can result in a lack of progress. CPU 0 | CPU 1 | rfcomm_run() | kthread_stop() ... | if (!test_bit(KTHREAD_SHOULD_STOP)) | | set_bit(KTHREAD_SHOULD_STOP) | wake_up_process() wait_woken() | wait_for_completion() set_current_state(INTERRUPTIBLE) | if (!WQ_FLAG_WOKEN) | schedule_timeout() | | After which both tasks will wait.. forever. Fix this by having wait_woken() check for kthread_should_stop() but only for kthreads (obviously). Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2014-11-03sched: move cfs task on a CPU with higher capacityVincent Guittot
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining capacity for CFS tasks can be significantly reduced. Once we detect such situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle load balance to check if it's worth moving its tasks on an idle CPU. Once the idle load_balance has selected the busiest CPU, it will look for an active load balance for only two cases : - there is only 1 task on the busiest CPU. - we haven't been able to move a task of the busiest rq. A CPU with a reduced capacity is included in the 1st case, and it's worth to actively migrate its task if the idle CPU has got full capacity. This test has been added in need_active_balance. As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task. The nohz_kick_needed function has been cleaned up a bit while adding the new test env.src_cpu and env.src_rq must be set unconditionnally because they are used in need_active_balance which is called even if busiest->nr_running equals 1 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2014-11-03sched: add SD_PREFER_SIBLING for SMT levelVincent Guittot
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
2014-11-03sched: replace capacity_factor by usageVincent Guittot
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it compares this value with the sum of nr_running to decide if the group is overloaded or not. But the group_capacity_factor is hardly working for SMT system, it sometimes works for big cores but fails to do the right thing for little cores. Below are two examples to illustrate the problem that this patch solves: 1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 (div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded even if we have only one task per CPU. 2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix [0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]). So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics : -The available CPU's capacity for CFS tasks which is already used by load_balance. -The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib has been re-introduced to compute the usage of a CPU by CFS tasks. group_capacity_factor and group_has_free_capacity has been removed and replaced by group_no_capacity. We compare the number of task with the number of CPUs and we evaluate the level of utilization of the CPUs to define if a group is overloaded or if a group has capacity to handle more tasks. For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task so it will be selected in priority (among the overloaded groups). Since [1], SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity because local is not overloaded. Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's no more used during load balance. [1] https://lkml.org/lkml/2014/8/12/295 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2014-11-03sched: get CPU's usage statisticVincent Guittot
Monitor the usage level of each group of each sched_domain level. The usage is the portion of cpu_capacity_orig that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group. The utilization_load_avg only takes into account the running time of the CFS tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and until the metrics are stabilized. The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the running load on the CPU whereas the available capacity for the CFS task is in the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range of the CPU to get the usage of the latter. The usage can then be compared with the available capacity (ie cpu_capacity) to deduct the usage level of a CPU. The frequency scaling invariance of the usage is not taken into account in this patch, it will be solved in another patch which will deal with frequency scaling invariance on the running_load_avg. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2014-11-03sched: add per rq cpu_capacity_origVincent Guittot
This new field cpu_capacity_orig reflects the original capacity of a CPU before being altered by rt tasks and/or IRQ The cpu_capacity_orig will be used: - to detect when the capacity of a CPU has been noticeably reduced so we can trig load balance to look for a CPU with better capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by scheduler whereas CPUs, which are really idle, are available. - evaluate the available capacity for CFS tasks Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
2014-11-03sched: make scale_rt invariant with frequencyVincent Guittot
The average running time of RT tasks is used to estimate the remaining compute capacity for CFS tasks. This remaining capacity is the original capacity scaled down by a factor (aka scale_rt_capacity). This estimation of available capacity must also be invariant with frequency scaling. A frequency scaling factor is applied on the running time of the RT tasks for computing scale_rt_capacity. In sched_rt_avg_update, we scale the RT execution time like below: rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT Then, scale_rt_capacity can be summarized by: scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period) We can optimize by removing right and left shift in the computation of rq->rt_avg and scale_rt_capacity The call to arch_scale_frequency_capacity in the rt scheduling path might be a concern for RT folks because I'm not sure whether we can rely on arch_scale_freq_capacity to be short and efficient ? Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2014-11-03sched: Make sched entity usage tracking scale-invariantMorten Rasmussen
Apply frequency scale-invariance correction factor to usage tracking. Each segment of the running_load_avg geometric series is now scaled by the current frequency so the utilization_avg_contrib of each entity will be invariant with frequency scaling. As a result, utilization_load_avg which is the sum of utilization_avg_contrib, becomes invariant too. So the usage level that is returned by get_cpu_usage, stays relative to the max frequency as the cpu_capacity which is is compared against. Then, we want the keep the load tracking values in a 32bits type, which implies that the max value of {runnable|running}_avg_sum must be lower than 2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow. cc: Paul Turner <pjt@google.com> cc: Ben Segall <bsegall@google.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2014-11-03sched: remove frequency scaling from cpu_capacityVincent Guittot
Now that arch_scale_cpu_capacity has been introduced to scale the original capacity, the arch_scale_freq_capacity is no longer used (it was previously used by ARM arch). Remove arch_scale_freq_capacity from the computation of cpu_capacity. The frequency invariance will be handled in the load tracking and not in the CPU capacity. arch_scale_freq_capacity will be revisited for scaling load with the current frequency of the CPUs in a later patch. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>