Age | Commit message (Collapse) | Author |
|
Conflicts:
include/linux/sched.h
kernel/sched/Makefile
|
|
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
On X86 there are already tracepoints for IRQ vectors through which IPIs
are handled. However this is highly X86 specific, and the IPI signaling
is not currently traced.
This is an attempt at adding generic IPI tracepoints to X86.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Add ftraces for debug purpose
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
It was recently added in the energy aware scheduler kernel tree the io latency
tracking mechanism. The purpose of this framework is to provide a way to
predict the IO latencies, in other words try to guess how long we will be
sleeping on waiting an IO. When the cpu goes idle, we know how long is the
sleep duration with the timer but then we rely on some statistics in the
menu governor, which is part of the cpuidle framework for other wakes up.
The io latency tracking will provide an additional information about the
length of the expected sleep time, which combined with the timer duration
should give us a more accurate prediction.
The first step of the io latency tracking was simply using a sliding average
of the values, which is not really accurate as it is not immune against IOs
ping pong or big variations.
In order to improve that, each latency is grouped into a bucket which
represent an interval of latency and for each bucket a sliding average is
computed.
Why ? Because we don't want to take all the latencies and compute the
statistics on them. It does not make sense, takes a lot of memory,
computation time, for finally a result which is mathematically impossible
to resolve. It is better to use intervals to group the small variations of
the latencies. For example. 186us, 123us, 134us can fall into the bucket
[100 - 199].
The size of the bucket is the bucket interval and represent the resolution
of the statistic model. Eg with a bucket interval of 1us, it leads us to
do statitics on all numbers, with of course a bad prediction because the
number of latencies is big. A big interval can give better statistics,
but can give us a misprediction as the interval is larger.
Choosing the size of the bucket interval vs the idle sleep time is the
tradeoff to find. With a 200us bucket interval, the measurements show
we still have good predictions, less mispredictions and cover the idle
state target residency.
The buckets are dynamically created and stored into a list. A new bucket is
added at the end of the list.
This list is always moving depending on the number of successives hits a
bucket will have. The more a bucket is successively hit, the more it will
be the first element of the list.
The guessed next latency, which is a bucket (understand it will be between
eg. 200us and 300us, with a bucket interval of 100us), is retrieved from
the list. Each bucket present in the list will mark a score, the more the
hits a bucket has, the bigger score it has. *But* this is weighted by
the position in the list. The first elements will have more weight than the
last ones. This position is dynamically changed when a bucket is hit several
times.
Example the following latencies:
10, 100, 100, 100, 100, 100, 10, 10
We will have two buckets: 0 and 1.
10 => bucket0(1)
100 => bucket0(1), bucket1(1)
100 => bucket0(1), bucket1(2)
100 => bucket0(1), bucket1(3)
100 => bucket0(1), bucket1(4)
* 100 => bucket1(5), bucket0(1)
10 => bucket1(5), bucket0(2)
10 => bucket1(5), bucket0(3)
At (*), bucket1 reached 5 successive hits at has been move at the beginning
of the list and bucket0 became the second one.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Evaluating the gap between the bet for the next event waking up a cpu and
the effective sleep duration is not possible today.
This simple patch provides a set of files to evaluate if the predictions
are correct or not.
There are splitted in 3 categories:
* under estimated : the sleep duration was greater than what we were expecting,
so we could have been deeper in the idle state. That is true, if the deeper
idle state was not disabled and fitting the exit latency requirement.
* over estimated : the sleep duration was smaller than what we were expecting and
we did not reach the break even for the selected idle state.
* well estimated : the excepted sleep duration lead to the correct idle state
selection and we got an optimum power energy saving.
These informations are exported in the debugfs in the sched/idle directory.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
This simple governor takes into account the predictable events: the timer sleep
duration and the next expected IO sleep duration. By mixing both it deduced
what idle state fits better. This governor must be extended to a statistical
approach to predict all the other events.
The main purpose of this governor is to handle the guessed next events in a
categorized way:
1. deterministic events : timers
2. guessed events : IOs
3. predictable events : keystroke, incoming network packet, ...
This governor is aimed to be moved later near the scheduler, so this one
can inspect/inject more informations and act proactively rather than
reactively.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Conflicts:
drivers/cpuidle/Kconfig
|
|
As we want to improve the sleep duration estimation, the IO latency expected
duration is passed to the cpuidle framework. The governors will have to deal
with if they are interested in this information.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Conflicts:
drivers/cpuidle/governors/menu.c
|
|
Moving around the different functions dealing with the time made the time
headers no longer necessary in cpuidle.c.
Remove them.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Following the logic of the previous patch, retrieve from the idle task the
expected timer sleep duration and pass it to the cpuidle framework.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
As we get the latency_req from cpuidle_idle_call, just pass it to the
cpuidle layer instead of duplicating the code across the governors.
That has the benefit of moving little by little the different timings
we want to integrate with the scheduler near this one.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Conflicts:
drivers/cpuidle/governors/menu.c
|
|
If the zero latency is required, we don't want to invoke any cpuidle code at
all.
Move the check within the governors and do the check before selecting the
state in order to fallback to the default idle function.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
In order to have a good prediction of when will occur the next event,
the cpuidle menu governor does some statistics about the occurences of
an event waking up a cpu. For more details, refer to the menu.c's
header file located in drivers/cpuidle/governors.
A part of the prediction is taking into account the number of pending
IO on the cpu and depending on this number it will use a 'magic'
number to force the selection of shallow states. It makes sense and
provided a good improvement in terms of system latencies for
server. Unfortunately there are some drawbacks of this approach. The
first one is the empirical approach, based on measurements for a
specific hardware and architecture giving the magic 'performance
multiplier' which may not fit well for different architecture as well
as new hardware which evolve during time. The second one is the mix of
all the wakeup sources making impossible to track when a task is
migrated across the cpus. And the last one, is the lack of correctly
tracking what is happening on the system.
In order to improve that, we can classify three kind of events:
1. totally predictable events : this is the case for the timers
2. partially predictable events : for example: hard disk accesses with
sleep time which are more or less inside a reasonable interval, the
same for SSD or SD-card
3. difficult to predict events : network incoming packet, keyboard or
mouse event. These ones need a statistical approach.
At this moment, 1., 2. and 3. are all mixed in the governors
statistics.
This patchset provides a simplified version of an io latency tracking
mechanism in order to separate and improve the category 2, that is the
partially predictable events. As the scheduler is a good place to
measure how long a task is blocked in an IO, all the code of this
patchset is tied with it.
The sched entity and the io latency tracking share the same design:
There is a rb tree per cpu. Each time a task is blocked on an IO, it
is inserted into the tree. When the IO is complete and the task is
woken up, its avg latency is updated with the time spent to wait the
IO and it is removed from the tree. The next time, it will be inserted
into the tree again in case of io_schedule.
If there are several tasks blocked on an IO, the left most node of the
tree is the minimal latency, in other words it gives for the IO the
next event. This information may need to be balanced regarding the
number of pendings IO (the more the disk is accessing, the slower it
is).
By splitting the three kind of categories we can guess more accurately
the next event on the system in conjunction with the next timer and
some simplified statistics from the menu governor. Furthermore, that
has the benefit to take into account a task migration as the information
is embedded with it.
This is a simplified version because the tracking could be greatly
improved by a polynomial regression like the sched entity tracking and
the latencies should also be per device but that implies a bigger
impact as the different callers of io_schedule have to provide the
dev_t parameter. A too complex patch won't help the understanding.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
This patch adds a sysctl schedule energy aware option to choose against:
* energy aware scheduler
* old scheduler code path
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Building on top of the scale invariant capacity patches and earlier
patches in this series that prepare CFS for scaling cpu frequency, this
patch implements a simple, naive ondemand-like cpu frequency scaling
policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This
new policy is named "energy_model" as an homage to the on-going work in
that area. It is NOT an actual energy model.
This policy is implemented using the CPUfreq governor interface for two
main reasons:
1) re-using the CPUfreq machine drivers without using the governor
interface is hard. I do not forsee any issue continuing to use the
governor interface going forward but it is worth making clear what this
patch does up front.
2) using the CPUfreq interface allows us to switch between the
energy_model governor and other CPUfreq governors (such as ondemand) at
run-time. This is very useful for comparative testing and tuning.
A caveat to #2 above is that the weak arch function used by the governor
means that only one scheduler-driven policy can be linked at a time.
This limitation does not apply to "traditional" governors. I raised this
in my previous capacity_ops patches[0] but as discussed at LPC14 last
week, it seems desirable to pursue a single cpu frequency scaling policy
at first, and try to make that work for everyone interested in using it.
If that model breaks down then we can revisit the idea of dynamic
selection of scheduler-driven cpu frequency scaling.
Unlike legacy CPUfreq governors, this policy does not implement its own
logic loop (such as a workqueue triggered by a timer), but instead uses
an event-driven design. Frequency is evaluated by entering
{en,de}queue_task_fair and then a kthread is woken from
run_rebalance_domains which scales cpu frequency based on the latest
evaluation.
The policy implemented in this patch takes the highest cpu utilization
from policy->cpus and uses that select a frequency target based on the
same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled
thresholds are pre-computed when energy_model inits. The frequency
selection is a simple comparison of cpu utilization (as defined in
Morten's latest RFC) to the threshold values. In the future this logic
could be replaced with something more sophisticated that uses PELT to
get a historical overview. Ideas are welcome.
Note that the pre-computed thresholds above do not take into account
micro-architecture differences (SMT or big.LITTLE hardware), only
frequency invariance.
Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
|
|
{en,de}queue_task_fair are updated to track which cpus will have changed
utilization values as function of task queueing. The affected cpus are
passed on to arch_eval_cpu_freq for further machine-specific processing
based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to
kick off the scaling process (via wake_up_process), so as to prevent
re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make
sense to track which cpus have updated statistics in enqueue_fair_task?
I chose this because I wanted to gather statistics for all cpus affected
in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
next version of this patch will focus on the simpler case of not using
scheduler cgroups, which should remove a good chunk of this code,
including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very
interesting place to do this as frequency can be considered in concert
with task placement. Please put forth any ideas on a sensible way to do
this.
Is run_rebalance_domains a logical place to change cpu frequency? What
other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping
(x86, some ARM platforms with PM microcontrollers) it is currently
necessary to always kick the frequency target work out into a kthread.
This is because of the rw_sem usage in the cpufreq core which might
sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
|
|
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the
scheduler to evaluate if cpu frequency should change and to invoke that
change from a safe context.
They are weakly defined arch functions that do nothing by default. A
CPUfreq governor could use these functions to implement a frequency
scaling policy based on updates to per-task statistics or updates to
per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to
focus on a single cpu frequency scaling policy that works for everyone.
That may mean that the weak arch functions definitions can be removed
entirely and a single policy implements that logic for all
architectures.
Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
|
|
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
|
|
capacity_of and get_cpu_usage useful for cpu frequency scaling policies.
Share it via sched.h so that selectable cpu frequency scaling policies
can make use of it.
Signed-off-by: Mike Turquette <mturquette@linaro.org>
|
|
This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
[mturquette@linaro.org: moved energy_aware above enqueue_task_fair]
|
|
git://git.linaro.org/people/vincent.guittot/kernel into eas-next
|
|
As discussed [1], accounting IO is meant for blkio only. Document that
so driver authors won't use them for device io.
[1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470
Cc: Ingo Molnar <mingo@redhat.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
|
|
Remove question mark:
s/New utsname group?/New utsname namespace
Unified style for IPC:
s/New ipcs/New ipc namespace
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1415091082-15093-1-git-send-email-chenhanxiao@cn.fujitsu.com
|
|
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).
A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.
All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
|
|
There are both UP and SMP version of pull_dl_task(), so don't need
to check CONFIG_SMP in switched_from_dl();
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
|
|
In switched_from_dl() we have to issue a resched if we successfully
pulled some task from other cpus. This patch also aligns the behavior
with -rt.
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Suggested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
|
|
This patch pushes task away if the dealine of the task is equal
to current during wake up. The same behavior as rt class.
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
|
|
This patch add deadline rq status print.
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com
|
|
The yield semantic of deadline class is to reduce remaining runtime to
zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
is reduced from the budget of yield task again even if it has already been
set to zero which leads to artificial overrun. This patch fix it by make
sure we don't steal some more time from the task that yielded in update_curr_dl().
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Suggested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
|
|
This patch checks if current can be pushed/pulled somewhere else
in advance to make logic clear, the same behavior as dl class.
- If current can't be migrated, useless to reschedule, let's hope
task can move out.
- If task is migratable, so let's not schedule it and see if it
can be pushed or pulled somewhere else.
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng.li@linux.intel.com
|
|
As per commit f10e00f4bf36 ("sched/dl: Use dl_bw_of() under
rcu_read_lock_sched()"), dl_bw_of() has to be protected by
rcu_read_lock_sched().
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.com
|
|
Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
after we have found an idle cpu.
Cc: <peterz@infradead.org>
Cc: <mingo@redhat.com>
Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
|
|
Currently used hrtimer_try_to_cancel() is racy:
raw_spin_lock(&rq->lock)
... dl_task_timer raw_spin_lock(&rq->lock)
... raw_spin_lock(&rq->lock) ...
switched_from_dl() ... ...
hrtimer_try_to_cancel() ... ...
switched_to_fair() ... ...
... ... ...
... ... ...
raw_spin_unlock(&rq->lock) ... (asquired)
... ... ...
... ... ...
do_exit() ... ...
schedule() ... ...
raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock)
... ... ...
raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock)
... ... (asquired)
put_task_struct() ... ...
free_task_struct() ... ...
... ... raw_spin_unlock(&rq->lock)
... (asquired) ...
... ... ...
... (use after free) ...
So, let's implement 100% guaranteed way to cancel the timer and let's
be sure we are safe even in very unlikely situations.
rq unlocking does not limit the area of switched_from_dl() use, because
this has already been possible in pull_dl_task() below.
Let's consider the safety of of this unlocking. New code in the patch
is working when hrtimer_try_to_cancel() fails. This means the callback
is running. In this case hrtimer_cancel() is just waiting till the
callback is finished. Two
1) Since we are in switched_from_dl(), new class is not dl_sched_class and
new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
right after !dl_task() check. After that hrtimer_cancel() returns back too.
The above is:
raw_spin_lock(rq->lock); ...
... dl_task_timer()
... raw_spin_lock(rq->lock);
switched_from_dl() ...
hrtimer_try_to_cancel() ...
raw_spin_unlock(rq->lock); ...
hrtimer_cancel() ...
... raw_spin_unlock(rq->lock);
... return HRTIMER_NORESTART;
... ...
raw_spin_lock(rq->lock); ...
2) But the below is also possible:
dl_task_timer()
raw_spin_lock(rq->lock);
...
raw_spin_unlock(rq->lock);
raw_spin_lock(rq->lock); ...
switched_from_dl() ...
hrtimer_try_to_cancel() ...
... return HRTIMER_NORESTART;
raw_spin_unlock(rq->lock); ...
hrtimer_cancel(); ...
raw_spin_lock(rq->lock); ...
In this case hrtimer_cancel() returns immediately. Very unlikely case,
just to mention.
Nobody can manipulate the task, because check_class_changed() is
always called with pi_lock locked. Nobody can force the task to
participate in (concurrent) priority inheritance schemes (the same reason).
All concurrent task operations require pi_lock, which is held by us.
No deadlocks with dl_task_timer() are possible, because it returns
right after !dl_task() check (it does nothing).
If we receive a new dl_task during the time of unlocked rq, we just
don't have to do pull_dl_task() in switched_from_dl() further.
[peterz: Added comments]
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
|
|
In some cases this can trigger a true flood of output.
Requested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a
wait loop. The wait loop relies on current->state to function, but so
does mutex_lock(), nesting them makes for the inner to destroy the
outer state.
Fix this using the new wait_woken() bits.
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.net
|
|
rfcomm_run() is a tad broken in that is has a nested wait loop. One
cannot rely on p->state for the outer wait because the inner wait will
overwrite it.
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
The kauditd_thread wait loop is a bit iffy; it has a number of problems:
- calls try_to_freeze() before schedule(); you typically want the
thread to re-evaluate the sleep condition when unfreezing, also
freeze_task() issues a wakeup.
- it unconditionally does the {add,remove}_wait_queue(), even when the
sleep condition is false.
Use wait_event_freezable() that does the right thing.
Cc: mingo@kernel.org
Cc: oleg@redhat.com
Cc: torvalds@linux-foundation.org
Cc: tglx@linutronix.de
Cc: Eric Paris <eparis@redhat.com>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20141002102251.GA6324@worktop.programming.kicks-ass.net
|
|
There is no user.. make it go away.
Cc: oleg@redhat.com
Cc: Rafael Wysocki <rjw@rjwysocki.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Provide better implementations of wait_event_freezable() APIs.
The problem is with freezer_do_not_count(), it hides the thread from
the freezer, even though this thread might not actually freeze/sleep
at all.
Cc: oleg@redhat.com
Cc: Rafael Wysocki <rjw@rjwysocki.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-d86fz1jmso9wjxa8jfpinp8o@git.kernel.org
|
|
There is a race between kthread_stop() and the new wait_woken() that
can result in a lack of progress.
CPU 0 | CPU 1
|
rfcomm_run() | kthread_stop()
... |
if (!test_bit(KTHREAD_SHOULD_STOP)) |
| set_bit(KTHREAD_SHOULD_STOP)
| wake_up_process()
wait_woken() | wait_for_completion()
set_current_state(INTERRUPTIBLE) |
if (!WQ_FLAG_WOKEN) |
schedule_timeout() |
|
After which both tasks will wait.. forever.
Fix this by having wait_woken() check for kthread_should_stop() but
only for kthreads (obviously).
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.
Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases :
- there is only 1 task on the busiest CPU.
- we haven't been able to move a task of the busiest rq.
A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got full capacity. This test has
been added in need_active_balance.
As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new
test
env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
|
|
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not. But the group_capacity_factor is hardly working for SMT
system, it sometimes works for big cores but fails to do the right thing for
little cores.
Below are two examples to illustrate the problem that this patch solves:
1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
(div_round_closest(3x640/1024) = 2) which means that it will be seen as
overloaded even if we have only one task per CPU.
2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
(1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
(at max and thanks to the fix [0] for SMT system that prevent the apparition
of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
reduced to nearly nothing), the capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
has been re-introduced to compute the usage of a CPU by CFS tasks.
group_capacity_factor and group_has_free_capacity has been removed and replaced
by group_no_capacity. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.
For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
because local is not overloaded.
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's no more used during load balance.
[1] https://lkml.org/lkml/2014/8/12/295
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.
The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
until the metrics are stabilized.
The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the running_load_avg.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ
The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
trig load balance to look for a CPU with better capacity. As an example, we
can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
|
|
The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.
A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.
In sched_rt_avg_update, we scale the RT execution time like below:
rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
Then, scale_rt_capacity can be summarized by:
scale_rt_capacity = SCHED_CAPACITY_SCALE -
((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
We can optimize by removing right and left shift in the computation of rq->rt_avg
and scale_rt_capacity
The call to arch_scale_frequency_capacity in the rt scheduling path might be
a concern for RT folks because I'm not sure whether we can rely on
arch_scale_freq_capacity to be short and efficient ?
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_load_avg geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch). Remove arch_scale_freq_capacity from the
computation of cpu_capacity. The frequency invariance will be handled in the
load tracking and not in the CPU capacity. arch_scale_freq_capacity will be
revisited for scaling load with the current frequency of the CPUs in a later
patch.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
|