- Nov 23, 2017
-
-
Dietmar Eggemann authored
The trace event key load is mapped to: (1) load : cfs_rq->tg->load_avg The cfs_rq owned by the task_group is used as the only parameter for the trace event because it has a reference to the taskgroup and the cpu. Using the taskgroup as a parameter instead would require the cpu as a second parameter. A task_group is global and not per-cpu data. The cpu key only tells on which cpu the value was gathered. The following list shows examples of the key=value pairs for: (1) a task group: cpu=1 path=/tg1/tg11/tg111 load=517 (2) an autogroup: cpu=1 path=/autogroup-10 load=1050 We don't maintain a load signal for a root task group. The trace event is only defined if cfs group scheduling support (CONFIG_FAIR_GROUP_SCHED) is enabled. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
The following trace event keys are mapped to: (1) load : se->avg.load_avg (2) rbl_load : se->avg.runnable_load_avg (3) util : se->avg.util_avg To let this trace event work for configurations w/ and w/o group scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following special handling is necessary for non-existent key=value pairs: path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the sched_entity represents a task. comm = "(null)" : In case sched_entity represents a task_group. pid = -1 : In case sched_entity represents a task_group. The following list shows examples of the key=value pairs in different configurations for: (1) a task: cpu=0 path=(null) comm=sshd pid=2206 load=102 rbl_load=102 util=102 (2) a taskgroup: cpu=1 path=/tg1/tg11/tg111 comm=(null) pid=-1 load=882 rbl_load=882 util=510 (3) an autogroup: cpu=0 path=/autogroup-13 comm=(null) pid=-1 load=49 rbl_load=49 util=48 (4) w/o CONFIG_FAIR_GROUP_SCHED: cpu=0 path=(null) comm=sshd pid=2211 load=301 rbl_load=301 util=265 The trace event is only defined for CONFIG_SMP. The helper functions __trace_sched_cpu(), __trace_sched_path() and __trace_sched_id() are extended to deal with sched_entities as well. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org>
-
Dietmar Eggemann authored
The following trace event keys are mapped to: (1) load : cfs_rq->avg.load_avg (2) rbl_load : cfs_rq->avg.runnable_load_avg (2) util : cfs_rq->avg.util_avg To let this trace event work for configurations w/ and w/o group scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following special handling is necessary for a non-existent key=value pair: path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED. The following list shows examples of the key=value pairs in different configurations for: (1) a root task_group: cpu=4 path=/ load=6 rbl_load=6 util=331 (2) a task_group: cpu=1 path=/tg1/tg11/tg111 load=538 rbl_load=538 util=522 (3) an autogroup: cpu=3 path=/autogroup-18 load=997 rbl_load=997 util=517 (4) w/o CONFIG_FAIR_GROUP_SCHED: cpu=0 path=(null) load=314 rbl_load=314 util=289 The trace event is only defined for CONFIG_SMP. The helper function __trace_sched_path() can be used to get the length parameter of the dynamic array (path == NULL) and to copy the path into it (path != NULL). Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org>
-
Dietmar Eggemann authored
Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be available to be printed in the load tracking trace events provided by this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org>
-
Dietmar Eggemann authored
This patch makes the energy data available via procfs. The related files are placed as sub-directory named 'energy' inside the /proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those cpu/domain/group tuples which have energy information. The following example depicts the contents of /proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which has energy information attached to domain level 0. ├── cpu0 │ ├── domain0 │ │ ├── busy_factor │ │ ├── busy_idx │ │ ├── cache_nice_tries │ │ ├── flags │ │ ├── forkexec_idx │ │ ├── group0 │ │ │ └── energy │ │ │ ├── cap_states │ │ │ ├── idle_states │ │ │ ├── nr_cap_states │ │ │ └── nr_idle_states │ │ ├── group1 │ │ │ └── energy │ │ │ ├── cap_states │ │ │ ├── idle_states │ │ │ ├── nr_cap_states │ │ │ └── nr_idle_states │ │ ├── idle_idx │ │ ├── imbalance_pct │ │ ├── max_interval │ │ ├── max_newidle_lb_cost │ │ ├── min_interval │ │ ├── name │ │ ├── newidle_idx │ │ └── wake_idx │ └── domain1 │ ├── busy_factor │ ├── busy_idx │ ├── cache_nice_tries │ ├── flags │ ├── forkexec_idx │ ├── idle_idx │ ├── imbalance_pct │ ├── max_interval │ ├── max_newidle_lb_cost │ ├── min_interval │ ├── name │ ├── newidle_idx │ └── wake_idx The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value whereas 'idle_states' and 'cap_states' contain a vector of power consumption at this idle state respectively (compute capacity, power consumption) at this capacity state. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
To make sure that the capacity value of the last element of the capacity states vector of the energy model (EM) core (MC) level is equal to the cpu scale value, use this capacity value to overwrite the cpu scale value preeviously derived from the Cpu Invariant Engine (CIE). This patch is necessary as long as there is no complete EM support in device tree. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
Hi6220 has a single frequency domain spanning the two clusters. It needs the SYS sched domain (sd) to let the EAS algorithm work properly. The SD_SHARE_CAP_STATES flag is not set on SYS sd. This lets sd_ea (highest sd w/ energy model data) point to the SYS sd whereas sd_scs (highest sd w/ SD_SHARE_CAP_STATES set) points to the DIE sd. This setup allows the code in sched_group_energy() to set sg_shared_cap to the single sched group of the SYS sd covering all the cpus in the system as they are all part of the single frequency domain. The capacity and idle state vectors only contain entries w/ power values equal zero, so there is no system-wide energy contribution. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
Allow the energy model to contain a system level besides the already existing core and cluster level. This is necessary for platforms with frequency domains spanning all cpus to let the EAS algorithm work properly. The whole idea of this system level has to be rethought once the idea of the 'struct sched_domain_shared' gets more momentum: https://lkml.org/lkml/2016/6/16/209 Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
Hikey is an SMP platform, so this property would normally not be necessary. But since we drive the setting of the EAS specific sched domain flag SD_SHARE_CAP_STATES via the init_cpu_capacity_callback() cpufreq notifier we have to make sure that cap_parsing_failed is not set to true in parse_cpu_capacity() so that init_cpu_capacity_callback() will bail out before consuming the CPUFREQ_NOTIFY. The easiest way to achieve this is to provide the dts file with this property. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
To be able to support multiple energy models before we have the full-fletched dt solution in arm64 (e.g. for platform Arm Juno and Hisilicon Hikey) factor out the static energy model data and the appropriate access function into energy_model.h. The patch uses of_match_node() to match the compatible string with the appropriate platform energy model data, i.e. the patch introduces a dependency to CONFIG_OF_FLATTREE for propagating the energy model data towards the task scheduler. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Juri Lelli authored
This patch is only here to be able to test provisioning of energy related data from an arch topology shim layer to the scheduler. Since there is no code today which deals with extracting energy related data from the dtb or acpi, and process it in the topology shim layer, the content of the sched_group_energy structures as well as the idle_state and capacity_state arrays are hard-coded here. This patch defines the sched_group_energy structure as well as the idle_state and capacity_state array for the cluster (relates to sched groups (sgs) in DIE sched domain level) and for the core (relates to sgs in MC sd level) for a Cortex A53 as well as for a Cortex A57. It further provides related implementations of the sched_domain_energy_f functions (cpu_cluster_energy() and cpu_core_energy()). To be able to propagate this information from the topology shim layer to the scheduler, the elements of the arm_topology[] table have been provisioned with the appropriate sched_domain_energy_f functions. Signed-off-by:
Juri Lelli <juri.lelli@arm.com>
-
Dietmar Eggemann authored
To make sure that the capacity value of the last element of the capacity states vector of the energy model (EM) core (MC) level is equal to the cpu scale value, use this capacity value to overwrite the cpu scale value preeviously derived from the Cpu Invariant Engine (CIE). This patch is necessary as long as there is no complete EM support in device tree. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
This patch is only here to be able to test provisioning of energy related data from an arch topology shim layer to the scheduler. Since there is no code today which deals with extracting energy related data from the dtb or acpi, and process it in the topology shim layer, the content of the sched_group_energy structures as well as the idle_state and capacity_state arrays are hard-coded here. This patch defines the sched_group_energy structure as well as the idle_state and capacity_state array for the cluster (relates to sched groups (sgs) in DIE sched domain level) and for the core (relates to sgs in MC sd level) for a Cortex A7 as well as for a Cortex A15. It further provides related implementations of the sched_domain_energy_f functions (cpu_cluster_energy() and cpu_core_energy()). To be able to propagate this information from the topology shim layer to the scheduler, the elements of the arm_topology[] table have been provisioned with the appropriate sched_domain_energy_f functions. cc: Russell King <linux@arm.linux.org.uk> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
for arm64 and arm: Add CpuFreq governors and make schedutil default Add Sched Debug Add Ftrace Add Function, Function Graph, Irqsoff, Preempt, Sched Tracer Add Prove Locking for arm64: Add Generic DT based CpuFreq driver - for hikey Add USB Net AX8817X - for hikey Add USB Net RTL8152 - for hikey Add HI6220 stub clock - for hikey for arm: Add Kernel .config support and /proc/config.gz Add DIE sched domain level Add Scheduler autogroups Add ARM Big.Little cpufreq driver - for TC2 Add ARM Big.Little cpuidle driver - for TC2 Add Sensor Vexpress - for TC2 Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
Detect and set the SD_SHARE_CAP_STATES sched_domain flag automatically based on the cpufreq policy related_cpus mask. Since the sched_domain flags functions don't take any parameters we have to assume that flags are the same for sched_domains are the same level, i.e. platforms mixing per-core and per-cluster DVFS is not supported. cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Dietmar Eggemann authored
The default CPU capacity is SCHED_CAPACITY_SCALE (1024). On a heterogeneous system (hmp) this value can be smaller for some cpus. The CPU capacity parsing code normalizes the capacity-dmips-mhz properties w.r.t. the highest value found while parsing the DT to SCHED_CAPACITY_SCALE. CPU capacity can also be changed by writing to /sys/devices/system/cpu/cpu*/cpu_capacity. To make sure that a subset of all online cpus still has a CPU capacity value of SCHED_CAPACITY_SCALE enforce in the appropriate sysfs attribute store function cpu_capacity_store(). This will avoid weird setup's like transforming an hmp into an smp system with a CPU capacity < SCHED_CAPACITY_SCALE for all cpus. The current cpu_capacity_store() assumes that all cpus of a cluster have the same CPU capacity value which is true for existing hmp systems (e.g. big.LITTLE). This assumption is also used by this patch. If the new CPU capacity value for a cpu is smaller than SCHED_CAPACITY_SCALE we iterate over the cpus which do not belong to the cpu's cluster and check that there is still a cpu with CPU capacity equal SCHED_CAPACITY_SCALE. The use of &cpu_topology[this_cpu].core_sibling is replaced by topology_core_cpumask(this_cpu). Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
Output cpu_capacity and raw_capacity in one pr_debug instead of using two. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Thara Gopinath authored
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level. The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios. 1. There is a misfit task in one of the cpu's in this sched_domain. 2. The total utilization of the domain is greater than the domain capacity The flag is cleared if no cpu in a sched domain is overutilized. This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately. Signed-off-by:
Thara Gopinath <thara.gopinath@linaro.org>
-
Morten Rasmussen authored
With energy-aware scheduling enabled nohz_kick_needed() generates many nohz idle-balance kicks which lead to nothing when multiple tasks get packed on a single cpu to save energy. This causes unnecessary wake-ups and hence wastes energy. Make these conditions depend on !energy_aware() for now until the energy-aware nohz story gets sorted out. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Dietmar Eggemann authored
In case the system operates below the tipping point indicator, introduced in ("sched: Add over-utilization/tipping point indicator"), bail out in find_busiest_group after the dst and src group statistics have been checked. There is simply no need to move usage around because all involved cpus still have spare cycles available. For an energy-aware system below its tipping point, we rely on the task placement of the wakeup path. This works well for short running tasks. The existence of long running tasks on one of the involved cpus lets the system operate over its tipping point. To be able to move such a task (whose load can't be used to average the load among the cpus) from a src cpu with lower capacity than the dst_cpu, an additional rule has to be implemented in need_active_balance. Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
When the systems is not overutilized, place waking tasks on the most energy efficient cpu. Previous attempts reduced the search space by matching task utilization to cpu capacity before consulting the energy model as this is an expensive operation. The search heuristics didn't work very well and lacking any better alternatives this patch takes the brute-force route and tries all potential targets. This approach doesn't scale, but it might be sufficient for many embedded applications while work is continuing on a heuristic that can minimize the necessary computations. The heuristic must be derrived from the platform energy model rather than make additional assumptions, such lower capacity implies better energy efficiency. PeterZ mentioned in the past that we might be able to derrive some simpler deciding functions using mathematical (modal?) analysis. Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% (1024) regardless of how many task additional task we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at it's highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For system where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
It is not worth the overhead to migrate tasks for tiny insignificant energy savings. To prevent this, an energy margin is introduced in energy_diff() which effectively adds a dead-zone that rounds tiny energy differences to zero. Since no scale is enforced for energy model data the margin can't be absolute. Instead it is defined as +/-1.56% energy saving compared to the current total estimated energy consumption.
-
Dietmar Eggemann authored
To estimate the energy consumption of a sched_group in sched_group_energy() it is necessary to know which idle-state the group is in when it is idle. For now, it is assumed that this is the current idle-state (though it might be wrong). Based on the individual cpu idle-states group_idle_state() finds the group idle-state. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Adds a generic energy-aware helper function, energy_diff(), that calculates energy impact of adding, removing, and migrating utilization in the system. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Extended sched_group_energy() to support energy prediction with usage (tasks) added/removed from a specific cpu or migrated between a pair of cpus. Useful for load-balancing decision making. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
For energy-aware load-balancing decisions it is necessary to know the energy consumption estimates of groups of cpus. This patch introduces a basic function, sched_group_energy(), which estimates the energy consumption of the cpus in the group and any resources shared by the members of the group. NOTE: The function has five levels of identation and breaks the 80 character limit. Refactoring is necessary. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_ea, points to the highest level at which energy model is provided. At this level and all levels below all sched_groups have energy model data attached. Partial energy model information is possible but restricted to providing energy model data for lower level sched_domains (sd_ea and below) and leaving load-balancing on levels above to non-energy-aware load-balancing. For example, it is possible to apply energy-aware scheduling within each socket on a multi-socket system and let normal scheduling handle load-balancing between sockets. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Move cpu_util() to an earlier position in fair.c and change return type to unsigned long as negative usage doesn't make much sense. All other load and capacity related functions use unsigned long including the caller of cpu_util(). cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Dietmar Eggemann authored
For Energy-Aware Scheduling (EAS) to work properly, even in the case that there is only one cpu per cluster or that cpus are hot-plugged out, the Energy Model (EM) data on all energy-aware sched domains (sd) has to be present for all online cpus. Mainline sd hierarchy setup code will remove sd's which are not useful for task scheduling e.g. in the following situations: 1. Only 1 cpu is/remains in one cluster of a multi cluster system. This remaining cpu only has DIE and no MC sd. 2. A complete cluster in a two cluster system is hot-plugged out. The cpus of the remaining cluster only have MC and no DIE sd. To make sure that all online cpus keep all their energy-aware sd's, the sd degenerate functionality has been changed to not free a sd if its first sched group (sg) contains EM data in case: 1. There is only 1 cpu left in the sd. 2. There have to be at least 2 sg's if certain sd flags are set. Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE flag. This will make sure that the EAS functionality will always see all energy-aware sd's for all online cpus. It will introduce a tiny performance degradation for operations on affected cpus since the hot-path macro for_each_domain() has to deal with sd's not contributing to task scheduling at all now. In most cases the exisiting code makes sure that task scheduling is not invoked on a sd with !SD_LOAD_BALANCE. However, a small change is necessary in update_sd_lb_stats() to make sure that sd->parent is only initialized to !NULL in case the parent sd contains more than 1 sg. The handling of newidle decay values before the SD_LOAD_BALANCE check in rebalance_domains() stays unchanged. Test (w/ CONFIG_SCHED_DEBUG): JUNO r0 default system: $ cat /proc/cpuinfo | grep "^CPU part" CPU part : 0xd03 CPU part : 0xd07 CPU part : 0xd07 CPU part : 0xd03 CPU part : 0xd03 CPU part : 0xd03 SD names and flags: $ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name MC DIE MC DIE MC DIE MC DIE MC DIE MC DIE $ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags` 832f 102f 832f 102f 832f 102f 832f 102f 832f 102f 832f 102f Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu: $ echo 0 > /sys/devices/system/cpu/cpu1/online $ cat /proc/cpuinfo | grep "^CPU part" CPU part : 0xd03 CPU part : 0xd07 CPU part : 0xd03 CPU part : 0xd03 CPU part : 0xd03 SD names and flags for remaining A57 (cpu2) cpu: $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name MC DIE $ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags` 832e <-- MC SD with !SD_LOAD_BALANCE 102f Test 2: Hotplug-out the entire A57 cluster: $ echo 0 > /sys/devices/system/cpu/cpu1/online $ echo 0 > /sys/devices/system/cpu/cpu2/online $ cat /proc/cpuinfo | grep "^CPU part" CPU part : 0xd03 CPU part : 0xd03 CPU part : 0xd03 CPU part : 0xd03 SD names and flags for the remaining A53 (CPU part 0xd03) cluster: $ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name MC DIE MC DIE MC DIE MC DIE $ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags` 832f 102e <-- DIE SD with !SD_LOAD_BALANCE 832f 102e 832f 102e 832f 102e Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
cpufreq is currently keeping it a secret which cpus are sharing clock source. The scheduler needs to know about clock domains as well to become more energy aware. The SD_SHARE_CAP_STATES domain flag indicates whether cpus belonging to the sched_domain share capacity states (P-states). There is no connection with cpufreq (yet). The flag must be set by the arch specific topology code. cc: Russell King <linux@arm.linux.org.uk> cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Dietmar Eggemann authored
The sched_group_energy (sge) pointer of the first sched_group (sg) in the sched_domain (sd) is initialized to point to the appropriate (in terms of sd level and cpu) sge data defined in the arch and so to the correct part of the Energy Model (EM). Energy-aware scheduling allows that a system has only EM data up to a certain sd level (so called highest energy aware balancing sd level). A check in init_sched_energy() enforces that all sd's below this sd level contain EM data. The 'int cpu' parameter of sched_domain_energy_f requires that check_sched_energy_data() makes sure that all cpus spanned by a sg are provisioned with the same EM data. This patch has also been tested with feature FORCE_SD_OVERLAP enabled. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Dietmar Eggemann authored
The struct sched_group_energy represents the per sched_group related data which is needed for energy aware scheduling. It contains: (1) number of elements of the idle state array (2) pointer to the idle state array which comprises 'power consumption' for each idle state (3) number of elements of the capacity state array (4) pointer to the capacity state array which comprises 'compute capacity and power consumption' tuples for each capacity state The struct sched_group obtains a pointer to a struct sched_group_energy. The function pointer sched_domain_energy_f is introduced into struct sched_domain_topology_level which will allow the arch to pass a particular struct sched_group_energy from the topology shim layer into the scheduler core. The function pointer sched_domain_energy_f has an 'int cpu' parameter since the folding of two adjacent sd levels via sd degenerate doesn't work for all sd levels. I.e. it is not possible for example to use this feature to provide per-cpu energy in sd level DIE on ARM's TC2 platform. It was discussed that the folding of sd levels approach is preferable over the cpu parameter approach, simply because the user (the arch specifying the sd topology table) can introduce less errors. But since it is not working, the 'int cpu' parameter is the only way out. It's possible to use the folding of sd levels approach for sched_domain_flags_f and the cpu parameter approach for the sched_domain_energy_f at the same time though. With the use of the 'int cpu' parameter, an extra check function has to be provided to make sure that all cpus spanned by a sched group are provisioned with the same energy data. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
This patch introduces the ENERGY_AWARE sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set false when SCHED_DEBUG is not defined. Hence this doesn't allow energy awareness to be enabled without SCHED_DEBUG. This sched_feature knob will be replaced later with a more appropriate control knob when things have matured a bit. ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED must be enable. This dependency isn't checked at compile time yet. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
This documentation patch provides an overview of the experimental scheduler energy costing model, associated data structures, and a reference recipe on how platforms can be characterized to derive energy models. Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Brendan Jackman authored
We now have a NOHZ kick to avoid the load of idle CPUs becoming stale. This is good, but it brings about CPU wakeups, which have an energy cost. As an alternative to waking CPUs up to do decay blocked load, we can sometimes do it from newly idle balance. If the newly idle balance is on a domain that covers all the currently nohz-idle CPUs, we push the value of nohz.next_update into the future. That means that if such newly idle balances happen often enough, we never need wake up a CPU just to update load. Since we're doing this new update inside a for_each_domain, we need to do something to avoid doing multiple updates on the same CPU in the same idle_balance. A tick stamp is set on the rq in update_blocked_averages as a simple way to do this. Using a simple jiffies-based timestamp, as opposed to the last_update_time of the root cfs_rq's sched_avg, means we can do this without taking the rq lock. Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Brendan Jackman <brendan.jackman@arm.com>
-
Vincent Guittot authored
When idle, the blocked load of CPUs will be updated only when an idle load balance is triggered which may never happen. Because of this uncertainty on the execution of idle load balance, the utilization, the load and the shares of idle cfs_rq can stay artificially high and steal shares and running time to busy cfs_rqs of the task group. Add a new light idle load balance state which ensures that blocked loads are periodically updated and decayed but does not perform any task migration. The remote load udpates are rate-limited, so that they are not performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT half-life). This is the period after which we have a known 50% error in stale load. Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Vincent Guittot <vincent.guittot@linaro.org> [Switched remote update interval to use PELT half life] [Moved update_blocked_averges call outside rebalance_domains to simplify code] Signed-off-by:
Brendan Jackman <brendan.jackman@arm.com>
-
Morten Rasmussen authored
The patch lets the arch_topology driver take over setting of sched_domain flags that should be detected dynamically based on the actual system topology. cc: Catalin Marinas <catalin.marinas@arm.com> cc: Will Deacon <will.deacon@arm.com> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-