vincent.guittot/kernel.git - Linaro scheduler wip

Age	Commit message (Collapse)	Author
2015-07-27	cpufreq: introduce cpufreq_driver_might_sleep	Michael Turquette
	Some architectures and platforms perform CPU frequency transitions through a non-blocking method, while some might block or sleep. This distinction is important when trying to change frequency from interrupt context or in any other non-interruptable context, such as from the Linux scheduler. Describe this distinction with a cpufreq driver flag, CPUFREQ_DRIVER_WILL_NOT_SLEEP. The default is to not have this flag set, thus erring on the side of caution. cpufreq_driver_might_sleep() is also introduced in this patch. Setting the above flag will allow this function to return false. Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
2015-07-27	sched: Prevent unnecessary active balance of single task in sched group	morten.rasmussen@arm.com
	Scenarios with the busiest group having just one task and the local being idle on topologies with sched groups with different numbers of cpus manage to dodge all load-balance bailout conditions resulting the nr_balance_failed counter to be incremented. This eventually causes an pointless active migration of the task. This patch prevents this by not incrementing the counter when the busiest group only has one task. ASYM_PACKING migrations and migrations due to reduced capacity should still take place as these are explicitly captured by need_active_balance(). A better solution would be to not attempt the load-balance in the first place, but that requires significant changes to the order of bailout conditions and statistics gathering. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Disable energy-unfriendly nohz kicks	morten.rasmussen@arm.com
	With energy-aware scheduling enabled nohz_kick_needed() generates many nohz idle-balance kicks which lead to nothing when multiple tasks get packed on a single cpu to save energy. This causes unnecessary wake-ups and hence wastes energy. Make these conditions depend on !energy_aware() for now until the energy-aware nohz story gets sorted out. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Enable idle balance to pull single task towards cpu with higher capacity	Dietmar Eggemann
	We do not want to miss out on the ability to pull a single remaining task from a potential source cpu towards an idle destination cpu if the energy aware system operates above the tipping point. Add an extra criteria to need_active_balance() to kick off active load balance if the source cpu is over-utilized and has lower capacity than the destination cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Consider a not over-utilized energy-aware system as balanced	Dietmar Eggemann
	In case the system operates below the tipping point indicator, introduced in ("sched: Add over-utilization/tipping point indicator"), bail out in find_busiest_group after the dst and src group statistics have been checked. There is simply no need to move usage around because all involved cpus still have spare cycles available. For an energy-aware system below its tipping point, we rely on the task placement of the wakeup path. This works well for short running tasks. The existence of long running tasks on one of the involved cpus lets the system operate over its tipping point. To be able to move such a task (whose load can't be used to average the load among the cpus) from a src cpu with lower capacity than the dst_cpu, an additional rule has to be implemented in need_active_balance. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Energy-aware wake-up task placement	morten.rasmussen@arm.com
	Let available compute capacity and estimated energy impact select wake-up target cpu when energy-aware scheduling is enabled and the system in not over-utilized (above the tipping point). energy_aware_wake_cpu() attempts to find group of cpus with sufficient compute capacity to accommodate the task and find a cpu with enough spare capacity to handle the task within that group. Preference is given to cpus with enough spare capacity at the current OPP. Finally, the energy impact of the new target and the previous task cpu is compared to select the wake-up target cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Consider spare cpu capacity at task wake-up	morten.rasmussen@arm.com
	In mainline find_idlest_group() selects the wake-up target group purely based on group load which leads to suboptimal choices in low load scenarios. An idle group with reduced capacity (due to RT tasks or different cpu type) isn't necessarily a better target than a lightly loaded group with higher capacity. The patch adds spare capacity as an additional group selection parameter. The target group is now selected based on the following criteria listed by highest priority first: 1. If energy-aware scheduling is enabled the group with the lowest capacity containing a cpu with enough spare capacity to accommodate the task (with a bit to spare) is selected if such exists. 2. Return the group with the cpu with most spare capacity and this capacity is significant if such group exists. Significant spare capacity is currently at least 20% to spare. 3. Return the group with the lowest load, unless it is the local group in which case NULL is returned and the search is continued at the next (lower) level. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Add cpu capacity awareness to wakeup balancing	morten.rasmussen@arm.com
	Wakeup balancing is completely unaware of cpu capacity, cpu usage and task utilization. The task is preferably placed on a cpu which is idle in the instant the wakeup happens. New tasks (SD_BALANCE_{FORK,EXEC} are placed on an idle cpu in the idlest group if such can be found, otherwise it goes on the least loaded one. Existing tasks (SD_BALANCE_WAKE) are placed on the previous cpu or an idle cpu sharing the same last level cache. Hence existing tasks don't get a chance to migrate to a different group at wakeup in case the current one has reduced cpu capacity (due RT/IRQ pressure or different uarch e.g. ARM big.LITTLE). They may eventually get pulled by other cpus doing periodic/idle/nohz_idle balance, but it may take quite a while before it happens. This patch adds capacity awareness to find_idlest_{group,queue} (used by SD_BALANCE_{FORK,EXEC}) such that groups/cpus that can accommodate the waking task based on task utilization are preferred. In addition, wakeup of existing tasks (SD_BALANCE_WAKE) is sent through find_idlest_{group,queue} if the task doesn't fit the capacity of the previous cpu to allow it to escape (override wake_affine) when necessary instead of relying on periodic/idle/nohz_idle balance to eventually sort it out. The patch doesn't depend on any energy model infrastructure, but it is kept behind the energy_aware() static key despite being primarily a performance optimization as it may increase scheduler overhead slightly. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Determine the current sched_group idle-state	Dietmar Eggemann
	To estimate the energy consumption of a sched_group in sched_group_energy() it is necessary to know which idle-state the group is in when it is idle. For now, it is assumed that this is the current idle-state (though it might be wrong). Based on the individual cpu idle-states group_idle_state() finds the group idle-state. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Count number of shallower idle-states in struct sched_group_energy	morten.rasmussen@arm.com
	cpuidle associates all idle-states with each cpu while the energy model associates them with the sched_group covering the cpus coordinating entry to the idle-state. To look up the idle-state power consumption in the energy model it is therefore necessary to translate from cpuidle idle-state index to energy model index. For this purpose it is helpful to know how many idle-states that are listed in lower level sched_groups (in struct sched_group_energy). Example: ARMv8 big.LITTLE JUNO (Cortex A57, A53) idle-states: Idle-state cpuidle Energy model table indices index per-cpu sg per-cluster sg WFI 0 0 (0) Core power-down 1 1 0* Cluster power-down 2 (1) 1 For per-cpu sgs no translation is required. If cpuidle reports state index 0 or 1, the cpu is in WFI or core power-down, respectively. We can look the idle-power up directly in the sg energy model table. Idle-state cluster power-down, is represented in the per-cluster sg energy model table as index 1. Index 0* is reserved for cluster power consumption when the cpus all are in state 0 or 1, but cpuidle decided not to go for cluster power-down. Given the index from cpuidle we can compute the correct index in the energy model tables for the sgs at each level if we know how many states are in the tables in the child sgs. The actual translation is implemented in a later patch. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched, cpuidle: Track cpuidle state index in the scheduler	morten.rasmussen@arm.com
	The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Store system-wide maximum cpu capacity in root domain	Dietmar Eggemann
	To be able to compare the capacity of the target cpu with the highest cpu capacity of the system in the wakeup path, store the system-wide maximum cpu capacity in the root domain. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Add over-utilization/tipping point indicator	morten.rasmussen@arm.com
	Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. The over-utilization condition is conservatively chosen to indicate over-utilization as soon as one cpu is fully utilized at it's highest frequency. We don't consider groups as lumping usage and capacity together for a group of cpus may hide the fact that one or more cpus in the group are over-utilized while group-siblings are partially idle. The tasks could be served better if moved to another group with completely idle cpus. This is particularly problematic if some cpus have a significantly reduced capacity due to RT/IRQ pressure or if the system has cpus of different capacity (e.g. ARM big.LITTLE). cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Estimate energy impact of scheduling decisions	morten.rasmussen@arm.com
	Adds a generic energy-aware helper function, energy_diff(), that calculates energy impact of adding, removing, and migrating utilization in the system. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Extend sched_group_energy to test load-balancing decisions	morten.rasmussen@arm.com
	Extended sched_group_energy() to support energy prediction with usage (tasks) added/removed from a specific cpu or migrated between a pair of cpus. Useful for load-balancing decision making. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Calculate energy consumption of sched_group	morten.rasmussen@arm.com
	For energy-aware load-balancing decisions it is necessary to know the energy consumption estimates of groups of cpus. This patch introduces a basic function, sched_group_energy(), which estimates the energy consumption of the cpus in the group and any resources shared by the members of the group. NOTE: The function has five levels of identation and breaks the 80 character limit. Refactoring is necessary. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Highest energy aware balancing sched_domain level pointer	morten.rasmussen@arm.com
	Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_ea, points to the highest level at which energy model is provided. At this level and all levels below all sched_groups have energy model data attached. Partial energy model information is possible but restricted to providing energy model data for lower level sched_domains (sd_ea and below) and leaving load-balancing on levels above to non-energy-aware load-balancing. For example, it is possible to apply energy-aware scheduling within each socket on a multi-socket system and let normal scheduling handle load-balancing between sockets. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Relocated get_cpu_usage() and change return type	morten.rasmussen@arm.com
	Move get_cpu_usage() to an earlier position in fair.c and change return type to unsigned long as negative usage doesn't make much sense. All other load and capacity related functions use unsigned long including the caller of get_cpu_usage(). cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Compute cpu capacity available at current frequency	morten.rasmussen@arm.com
	capacity_orig_of() returns the max available compute capacity of a cpu. For scale-invariant utilization tracking and energy-aware scheduling decisions it is useful to know the compute capacity available at the current OPP of a cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	arm: topology: Define TC2 energy and provide it to the scheduler	Dietmar Eggemann
	This patch is only here to be able to test provisioning of energy related data from an arch topology shim layer to the scheduler. Since there is no code today which deals with extracting energy related data from the dtb or acpi, and process it in the topology shim layer, the content of the sched_group_energy structures as well as the idle_state and capacity_state arrays are hard-coded here. This patch defines the sched_group_energy structure as well as the idle_state and capacity_state array for the cluster (relates to sched groups (sgs) in DIE sched domain level) and for the core (relates to sgs in MC sd level) for a Cortex A7 as well as for a Cortex A15. It further provides related implementations of the sched_domain_energy_f functions (cpu_cluster_energy() and cpu_core_energy()). To be able to propagate this information from the topology shim layer to the scheduler, the elements of the arm_topology[] table have been provisioned with the appropriate sched_domain_energy_f functions. cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Introduce SD_SHARE_CAP_STATES sched_domain flag	morten.rasmussen@arm.com
	cpufreq is currently keeping it a secret which cpus are sharing clock source. The scheduler needs to know about clock domains as well to become more energy aware. The SD_SHARE_CAP_STATES domain flag indicates whether cpus belonging to the sched_domain share capacity states (P-states). There is no connection with cpufreq (yet). The flag must be set by the arch specific topology code. cc: Russell King <linux@arm.linux.org.uk> cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Allocate and initialize energy data structures	Dietmar Eggemann
	The per sched group sched_group_energy structure plus the related idle_state and capacity_state arrays are allocated like the other sched domain (sd) hierarchy data structures. This includes the freeing of sched_group_energy structures which are not used. Energy-aware scheduling allows that a system only has energy model data up to a certain sd level (so called highest energy aware balancing sd level). A check in init_sched_energy enforces that all sd's below this sd level contain energy model data. One problem is that the number of elements of the idle_state and the capacity_state arrays is not fixed and has to be retrieved in __sdt_alloc() to allocate memory for the sched_group_energy structure and the two arrays in one chunk. The array pointers (idle_states and cap_states) are initialized here to point to the correct place inside the memory chunk. The new function init_sched_energy() initializes the sched_group_energy structure and the two arrays in case the sd topology level contains energy information. This patch has been tested with scheduler feature flag FORCE_SD_OVERLAP enabled as well. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Introduce energy data structures	Dietmar Eggemann
	The struct sched_group_energy represents the per sched_group related data which is needed for energy aware scheduling. It contains: (1) atomic reference counter for scheduler internal bookkeeping of data allocation and freeing (2) number of elements of the idle state array (3) pointer to the idle state array which comprises 'power consumption' for each idle state (4) number of elements of the capacity state array (5) pointer to the capacity state array which comprises 'compute capacity and power consumption' tuples for each capacity state Allocation and freeing of struct sched_group_energy utilizes the existing infrastructure of the scheduler which is currently used for the other sd hierarchy data structures (e.g. struct sched_domain) as well. That's why struct sd_data is provisioned with a per cpu struct sched_group_energy double pointer. The struct sched_group obtains a pointer to a struct sched_group_energy. The function pointer sched_domain_energy_f is introduced into struct sched_domain_topology_level which will allow the arch to pass a particular struct sched_group_energy from the topology shim layer into the scheduler core. The function pointer sched_domain_energy_f has an 'int cpu' parameter since the folding of two adjacent sd levels via sd degenerate doesn't work for all sd levels. I.e. it is not possible for example to use this feature to provide per-cpu energy in sd level DIE on ARM's TC2 platform. It was discussed that the folding of sd levels approach is preferable over the cpu parameter approach, simply because the user (the arch specifying the sd topology table) can introduce less errors. But since it is not working, the 'int cpu' parameter is the only way out. It's possible to use the folding of sd levels approach for sched_domain_flags_f and the cpu parameter approach for the sched_domain_energy_f at the same time though. With the use of the 'int cpu' parameter, an extra check function has to be provided to make sure that all cpus spanned by a sched group are provisioned with the same energy data. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Make energy awareness a sched feature	morten.rasmussen@arm.com
	This patch introduces the ENERGY_AWARE sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set false when SCHED_DEBUG is not defined. Hence this doesn't allow energy awareness to be enabled without SCHED_DEBUG. This sched_feature knob will be replaced later with a more appropriate control knob when things have matured a bit. ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED must be enable. This dependency isn't checked at compile time yet. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Documentation for scheduler energy cost model	morten.rasmussen@arm.com
	This documentation patch provides an overview of the experimental scheduler energy costing model, associated data structures, and a reference recipe on how platforms can be characterized to derive energy models. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Initialize CFS task load and usage before placing task on rq	morten.rasmussen@arm.com
	Task load or usage is not currently considered in select_task_rq_fair(), but if we want that in the future we should make sure it is not zero for new tasks. The load-tracking sums are currently initialized using sched_slice(), that won't work before the task has been assigned a rq. Initialization is therefore changed to another semi-arbitrary value, sched_latency, instead. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Remove blocked load and utilization contributions of dying tasks	morten.rasmussen@arm.com
	Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued with the DEQUEUE_SLEEP flag which causes their load and utilization contributions to be added to the runqueue blocked load and utilization. Hence they will contain load or utilization that is gone away. The issue only exists for the root cfs_rq as cgroup_exit() doesn't set DEQUEUE_SLEEP for task group exits. If runnable+blocked load is to be used as a better estimate for cpu load the dead task contributions need to be removed to prevent load_balance() (idle_balance() in particular) from over-estimating the cpu load. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Include blocked utilization in usage tracking	morten.rasmussen@arm.com
	Add the blocked utilization contribution to group sched_entity utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage(). With this change cpu usage now includes recent usage by currently non-runnable tasks, hence it provides a more stable view of the cpu usage. It does, however, also mean that the meaning of usage is changed: A cpu may be momentarily idle while usage is >0. It can no longer be assumed that cpu usage >0 implies runnable tasks on the rq. cfs_rq->utilization_load_avg or nr_running should be used instead to get the current rq status. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Track blocked utilization contributions	morten.rasmussen@arm.com
	Introduces the blocked utilization, the utilization counter-part to cfs_rq->blocked_load_avg. It is the sum of sched_entity utilization contributions of entities that were recently on the cfs_rq and are currently blocked. Combined with the sum of utilization of entities currently on the cfs_rq or currently running (cfs_rq->utilization_load_avg) this provides a more stable average view of the cpu usage. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Get rid of scaling usage by cpu_capacity_orig	Dietmar Eggemann
	Since now we have besides frequency invariant also cpu (uarch plus max system frequency) invariant cfs_rq::utilization_load_avg both, frequency and cpu scaling happens as part of the load tracking. So cfs_rq::utilization_load_avg does not have to be scaled by the original capacity of the cpu again. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	arm: Cpu invariant scheduler load-tracking support	Dietmar Eggemann
	Reuses the existing infrastructure for cpu_scale to provide the scheduler with a cpu scaling correction factor for more accurate load-tracking. This factor comprises a micro-architectural part, which is based on the cpu efficiency value of a cpu as well as a platform-wide max frequency part, which relates to the dtb property clock-frequency of a cpu node. The calculation of cpu_scale, return value of arch_scale_cpu_capacity, changes from capacity / middle_capacity with capacity = (clock_frequency >> 20) * cpu_efficiency to SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf The range of the cpu_scale value changes from [0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE]. The functionality to calculate the middle_capacity which corresponds to an 'average' cpu has been taken out since the scaling is now done differently. In the case that either the cpu efficiency or the clock-frequency value for a cpu is missing, no cpu scaling is done for any cpu. The platform-wide max frequency part of the factor should not be confused with the frequency invariant scheduler load-tracking support which deals with frequency related scaling due to DFVS functionality on a cpu. Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Make usage tracking cpu scale-invariant	Dietmar Eggemann
	Besides the existing frequency scale-invariance correction factor, apply cpu scale-invariance correction factor to usage tracking. Cpu scale-invariance takes cpu performance deviations due to micro-architectural differences (i.e. instructions per seconds) between cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency value of the highest OPP between cpus in SMP systems into consideration. Each segment of the sched_avg::running_avg_sum geometric series is now scaled by the cpu performance factor too so the sched_avg::utilization_avg_contrib of each entity will be invariant from the particular cpu of the HMP/SMP system it is gathered on. So the usage level that is returned by get_cpu_usage stays relative to the max cpu performance of the system. In contrast to usage, load (sched_avg::runnable_avg_sum) is currently not considered to be made cpu scale-invariant because this will have a negative effect on the the existing load balance code based on s[dg]_lb_stats::avg_load in overload scenarios. example: 7 always running tasks 4 on cluster 0 (2 cpus w/ cpu_capacity=512) 3 on cluster 1 (1 cpu w/ cpu_capacity=1024) cluster 0 cluster 1 capacity 1024 (2512) 1024 (11024) load 4096 3072 cpu-scaled load 2048 3072 Simply using cpu-scaled load in the existing lb code would declare cluster 1 busier than cluster 0, although the compute capacity budget for one task is higher on cluster 1 (1024/3 = 341) than on cluster 0 (2*512/4 = 256). Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	arm: Update arch_scale_cpu_capacity() to reflect change to define	morten.rasmussen@arm.com
	arch_scale_cpu_capacity() is no longer a weak function but a #define instead. Include the #define in topology.h. cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	sched: Convert arch_scale_cpu_capacity() from weak function to #define	morten.rasmussen@arm.com
	Bring arch_scale_cpu_capacity() in line with the recent change of its arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched: Optimize freq invariant accounting") from weak function to #define to allow inlining of the function. While at it, remove the ARCH_CAPACITY sched_feature as well. With the change to #define there isn't a straightforward way to allow runtime switch between an arch implementation and the default implementation of arch_scale_cpu_capacity() using sched_feature. The default was to use the arch-specific implementation, but only the arm architecture provided one. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	arm: vexpress: Add CPU clock-frequencies to TC2 device-tree	Dietmar Eggemann
	To enable the parsing of clock frequency and cpu efficiency values inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the relative capacity of the cpus, this property has to be provided within the cpu nodes of the dts file. The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel (LSK) massaged into mainline. Cc: Jon Medhurst <tixy@linaro.org> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2015-07-27	sched: Make load tracking frequency scale-invariant	Dietmar Eggemann
	Apply frequency scale-invariance correction factor to load tracking. Each segment of the sched_avg::runnable_avg_sum geometric series is now scaled by the current frequency so the sched_avg::load_avg_contrib of each entity will be invariant with frequency scaling. As a result, cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib, becomes invariant too. So the load level that is returned by weighted_cpuload, stays relative to the max frequency of the cpu. Then, we want the keep the load tracking values in a 32bits type, which implies that the max value of sched_avg::{runnable\|running}_avg_sum must be lower than 2^32/88761=48388 (88761 is the max weight of a task). As LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2015-07-27	arm: Frequency invariant scheduler load-tracking support	Morten Rasmussen
	Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is: current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu) This implementation only provides frequency invariance. No cpu invariance yet. Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2015-07-27	arm64: dts: mediatek: Add MT8173 MMC dts	Eddie Huang
	Add node mmc0 ~ mmc3 for mt8173.dtsi Add node mmc0, mmc1 for mt8173-evb.dts Change-Id: I150e7bbe9d9c110135eef1795bc0719934f2c098 Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com> Signed-off-by: Eddie Huang <eddie.huang@mediatek.com>
2015-07-27	arm64: dts: mt8173: Add mt8173 cpufreq driver support	Pi-Cheng Chen
	This patch adds the required properties in device tree to enable MT8173 cpufreq driver. Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
2015-07-27	cpufreq: mediatek: Add MT8173 cpufreq driver	Pi-Cheng Chen
	Mediatek MT8173 is an ARMv8 based quad-core (2Cortex-A53 and 2Cortex-A72) SoC with duall clusters. For each cluster, two voltage inputs, Vproc and Vsram are supplied by two regulators. For the big cluster, two regulators come from different PMICs. In this case, when scaling voltage inputs of the cluster, the voltages of two regulator inputs need to be controlled by software explicitly under the SoC specific limitation: 100mV < Vsram - Vproc < 200mV which is called 'voltage tracking' mechanism. And when scaling the frequency of cluster clock input, the input MUX clock need to be parented to another "intermediate" stable PLL first and reparented to the original PLL once the original PLL is stable at the target frequency. This patch implements those mechanisms to enable CPU DVFS support for Mediatek MT8173 SoC. Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
2015-07-27	arm64: dts: mediatek: add xHCI & usb phy for mt8173	Chunfeng Yun
	Change-Id: I2c12414879b827346c0329f3a5189299b764811b Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
2015-07-27	xhci: mediatek: support MTK xHCI host controller	Chunfeng Yun
	MTK xhci host controller defines some extra SW scheduling parameters for HW to minimize the scheduling effort for synchronous and interrupt endpoints. The parameters are put into reseved DWs of slot context and endpoint context Change-Id: Ic7644f98e1d747a8e863dd3fb3c99cbcb728cde6 Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
2015-07-27	usb: phy: add usb3.0 phy driver for mt65xx SoCs	Chunfeng Yun
	Change-Id: I69c12fa9e9c716249dc8f35f0959425ead1f4265 Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
2015-07-27	dt-bindings: Add a binding for Mediatek xHCI host controller	Chunfeng Yun
	add a DT binding documentation of xHCI host controller for the MT8173 SoC from Mediatek. Change-Id: I0ac0884c4bc6e3783c4f9cc028c061b019dd2825 Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
2015-07-27	dt-bindings: Add usb3.0 phy binding for MT65xx SoCs	Chunfeng Yun
	add a DT binding documentation of usb3.0 phy for MT65xx SoCs from Mediatek. Change-Id: Ia8099589ac3953bfa933a336048e4889450a1360 Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
2015-07-27	clk: mediatek: Export CPU mux clocks for CPU frequency control	pi-cheng.chen
	This patch adds CPU mux clocks which are used by Mediatek cpufreq driver for intermediate clock source switching. Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org>
2015-07-27	clk: mediatek: Add USB clock support in MT8173 APMIXEDSYS	James Liao
	Add REF2USB_TX clock support into MT8173 APMIXEDSYS. This clock is needed by USB 3.0. Change-Id: I6c94fa8ce5f35dad7a1cda6dbf403b3bd6b81905 Signed-off-by: James Liao <jamesjj.liao@mediatek.com>
2015-07-27	clk: mediatek: Add subsystem clocks of MT8173	James Liao
	Most multimedia subsystem clocks will be accessed by multiple drivers, so it's a better way to manage these clocks in CCF. This patch adds clock support for MM, IMG, VDEC, VENC and VENC_LT subsystems. Change-Id: I9cc9f8a066806d3c2d0769062182dad419499628 Signed-off-by: James Liao <jamesjj.liao@mediatek.com>
2015-07-27	dt-bindings: ARM: Mediatek: Document devicetree bindings for clock controllers	James Liao
	This adds the binding documentation for the mmsys, imgsys, vdecsys, vencsys and vencltsys controllers found on Mediatek SoCs. Change-Id: Id583b6030541292fafeaa1ede7d45ad7bf540226 Signed-off-by: James Liao <jamesjj.liao@mediatek.com>
2015-07-27	clk: mediatek: mt8173: Fix enabling of critical clocks	Sascha Hauer
	On the MT8173 the clocks are provided by different units. To enable the critical clocks we must be sure that all parent clocks are already registered, otherwise the parents of the critical clocks end up being unused and get disabled later. To find a place where all parents are registered we try each time after we've registered some clocks if all known providers are present now and only then we enable the critical clocks Change-Id: Iee564e380522380ddd06b109e3fcc8f0b6ec7c52 Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: James Liao <jamesjj.liao@mediatek.com> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org>