Commits · 6d869c5b8d4231b630145f77c0410fcfc9a9b749 · linux-arm / linux-power

Nov 23, 2017

sched/events: Introduce task_group load tracking trace event · 6d869c5b

Dietmar Eggemann authored Mar 17, 2017



The trace event key load is mapped to:

 (1) load : cfs_rq->tg->load_avg

The cfs_rq owned by the task_group is used as the only parameter for the
trace event because it has a reference to the taskgroup and the cpu.
Using the taskgroup as a parameter instead would require the cpu as a
second parameter. A task_group is global and not per-cpu data. The cpu
key only tells on which cpu the value was gathered.

The following list shows examples of the key=value pairs for:

 (1) a task group:

     cpu=1 path=/tg1/tg11/tg111 load=517

 (2) an autogroup:

     cpu=1 path=/autogroup-10 load=1050

We don't maintain a load signal for a root task group.

The trace event is only defined if cfs group scheduling support
(CONFIG_FAIR_GROUP_SCHED) is enabled.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

6d869c5b

sched/events: Introduce sched_entity load tracking trace event · 6d86076a

Dietmar Eggemann authored Mar 20, 2017



The following trace event keys are mapped to:

 (1) load     : se->avg.load_avg

 (2) rbl_load : se->avg.runnable_load_avg

 (3) util     : se->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for non-existent key=value pairs:

 path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the
                   sched_entity represents a task.

 comm = "(null)" : In case sched_entity represents a task_group.

 pid = -1        : In case sched_entity represents a task_group.

The following list shows examples of the key=value pairs in different
configurations for:

 (1) a task:

     cpu=0 path=(null) comm=sshd pid=2206 load=102 rbl_load=102  util=102

 (2) a taskgroup:

     cpu=1 path=/tg1/tg11/tg111 comm=(null) pid=-1 load=882 rbl_load=882 util=510

 (3) an autogroup:

     cpu=0 path=/autogroup-13 comm=(null) pid=-1 load=49 rbl_load=49 util=48

 (4) w/o CONFIG_FAIR_GROUP_SCHED:

     cpu=0 path=(null) comm=sshd pid=2211 load=301 rbl_load=301 util=265

The trace event is only defined for CONFIG_SMP.

The helper functions __trace_sched_cpu(), __trace_sched_path() and
__trace_sched_id() are extended to deal with sched_entities as well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>

6d86076a

sched/events: Introduce cfs_rq load tracking trace event · c4322f76

Dietmar Eggemann authored Mar 17, 2017



The following trace event keys are mapped to:

 (1) load     : cfs_rq->avg.load_avg

 (2) rbl_load : cfs_rq->avg.runnable_load_avg

 (2) util     : cfs_rq->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for a non-existent key=value pair:

 path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.

The following list shows examples of the key=value pairs in different
configurations for:

 (1) a root task_group:

     cpu=4 path=/ load=6 rbl_load=6 util=331

 (2) a task_group:

     cpu=1 path=/tg1/tg11/tg111 load=538 rbl_load=538 util=522

 (3) an autogroup:

     cpu=3 path=/autogroup-18 load=997 rbl_load=997 util=517

 (4) w/o CONFIG_FAIR_GROUP_SCHED:

     cpu=0 path=(null) load=314 rbl_load=314 util=289

The trace event is only defined for CONFIG_SMP.

The helper function __trace_sched_path() can be used to get the length
parameter of the dynamic array (path == NULL) and to copy the path into
it (path != NULL).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>

c4322f76

sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG · 08ae6215

Dietmar Eggemann authored Mar 17, 2017

Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If
CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be
available to be printed in the load tracking trace events provided by
this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>

08ae6215

sched/debug: Add energy procfs interface · 4e6424c7

Dietmar Eggemann authored Nov 14, 2014



This patch makes the energy data available via procfs. The related files
are placed as sub-directory named 'energy' inside the
/proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
cpu/domain/group tuples which have energy information.

The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has energy information attached to domain level 0.

├── cpu0
│   ├── domain0
│   │   ├── busy_factor
│   │   ├── busy_idx
│   │   ├── cache_nice_tries
│   │   ├── flags
│   │   ├── forkexec_idx
│   │   ├── group0
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_states
│   │   │       ├── nr_cap_states
│   │   │       └── nr_idle_states
│   │   ├── group1
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_states
│   │   │       ├── nr_cap_states
│   │   │       └── nr_idle_states
│   │   ├── idle_idx
│   │   ├── imbalance_pct
│   │   ├── max_interval
│   │   ├── max_newidle_lb_cost
│   │   ├── min_interval
│   │   ├── name
│   │   ├── newidle_idx
│   │   └── wake_idx
│   └── domain1
│       ├── busy_factor
│       ├── busy_idx
│       ├── cache_nice_tries
│       ├── flags
│       ├── forkexec_idx
│       ├── idle_idx
│       ├── imbalance_pct
│       ├── max_interval
│       ├── max_newidle_lb_cost
│       ├── min_interval
│       ├── name
│       ├── newidle_idx
│       └── wake_idx

The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
whereas 'idle_states' and 'cap_states' contain a vector of power
consumption at this idle state respectively (compute capacity, power
consumption) at this capacity state.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

4e6424c7

arm64: use cpu scale value derived from energy model · cc65f03d

Dietmar Eggemann authored Aug 07, 2017



To make sure that the capacity value of the last element of the capacity
states vector of the energy model (EM) core (MC) level is equal to the
cpu scale value, use this capacity value to overwrite the cpu scale
value preeviously derived from the Cpu Invariant Engine (CIE).

This patch is necessary as long as there is no complete EM support in
device tree.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

cc65f03d

arm64: define hikey620 sys sd energy model · cd4a2377

Dietmar Eggemann authored Aug 18, 2016



Hi6220 has a single frequency domain spanning the two clusters. It
needs the SYS sched domain (sd) to let the EAS algorithm work
properly.

The SD_SHARE_CAP_STATES flag is not set on SYS sd.

This lets sd_ea (highest sd w/ energy model data) point to the SYS
sd whereas sd_scs (highest sd w/ SD_SHARE_CAP_STATES set) points to
the DIE sd. This setup allows the code in sched_group_energy() to
set sg_shared_cap to the single sched group of the SYS sd covering
all the cpus in the system as they are all part of the single
frequency domain.

The capacity and idle state vectors only contain entries w/ power
values equal zero, so there is no system-wide energy contribution.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

cd4a2377

arm64: introduce sys sd energy model infrastructure · 49dc3bb7

Dietmar Eggemann authored Aug 18, 2016

Allow the energy model to contain a system level besides the already
existing core and cluster level.

This is necessary for platforms with frequency domains spanning all
cpus to let the EAS algorithm work properly.

The whole idea of this system level has to be rethought once
the idea of the 'struct sched_domain_shared' gets more momentum:

https://lkml.org/lkml/2016/6/16/209



Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

49dc3bb7

arm64, dts: add hikey cpu capacity-dmips-mhz information · 00ee664a

Dietmar Eggemann authored Aug 14, 2016

Hikey is an SMP platform, so this property would normally not be necessary.

But since we drive the setting of the EAS specific sched domain flag
SD_SHARE_CAP_STATES via the init_cpu_capacity_callback() cpufreq notifier
we have to make sure that cap_parsing_failed is not set to true in
parse_cpu_capacity() so that init_cpu_capacity_callback() will bail out
before consuming the CPUFREQ_NOTIFY. The easiest way to achieve this is to
provide the dts file with this property.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

00ee664a

arm64: add hikey energy model · cd52d507
Dietmar Eggemann authored Aug 16, 2016
```
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
```
cd52d507

arm64: factor out energy model from topology shim layer · 60d66416

Dietmar Eggemann authored Aug 15, 2016



To be able to support multiple energy models before we have the
full-fletched dt solution in arm64 (e.g. for platform Arm Juno and
Hisilicon Hikey) factor out the static energy model data and the
appropriate access function into energy_model.h.

The patch uses of_match_node() to match the compatible string with the
appropriate platform energy model data, i.e. the patch introduces a
dependency to CONFIG_OF_FLATTREE for propagating the energy model data
towards the task scheduler.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

60d66416

arm64, topology: Define JUNO energy and provide it to the scheduler · 1e983941

Juri Lelli authored Feb 10, 2015

This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.

This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A53 as well as for a Cortex A57.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).

To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>

1e983941

arm: use cpu scale value derived from energy model · 2ba06bfb

Dietmar Eggemann authored Aug 08, 2017



To make sure that the capacity value of the last element of the capacity
states vector of the energy model (EM) core (MC) level is equal to the
cpu scale value, use this capacity value to overwrite the cpu scale
value preeviously derived from the Cpu Invariant Engine (CIE).

This patch is necessary as long as there is no complete EM support in
device tree.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

2ba06bfb

arm: topology: Define TC2 energy and provide it to the scheduler · 7a35cdc0

Dietmar Eggemann authored Nov 14, 2014

This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).

cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

7a35cdc0

arm64, arm: Tweak defconfig and multi_v7_defconfig for EAS integration · aedb6d5c

Dietmar Eggemann authored Aug 07, 2016



for arm64 and arm:

Add    CpuFreq governors and make schedutil default
Add    Sched Debug
Add    Ftrace
Add    Function, Function Graph, Irqsoff, Preempt, Sched Tracer
Add    Prove Locking

for arm64:

Add    Generic DT based CpuFreq driver - for hikey
Add    USB Net AX8817X                 - for hikey
Add    USB Net RTL8152		       - for hikey
Add    HI6220 stub clock               - for hikey

for arm:

Add    Kernel .config support and /proc/config.gz
Add    DIE sched domain level
Add    Scheduler autogroups
Add    ARM Big.Little cpufreq driver - for TC2
Add    ARM Big.Little cpuidle driver - for TC2
Add    Sensor Vexpress               - for TC2

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

aedb6d5c

drivers base/arch_topology: Detect SD_SHARE_CAP_STATES flag · 3ffc4719

Morten Rasmussen authored Sep 20, 2017

Detect and set the SD_SHARE_CAP_STATES sched_domain flag automatically
based on the cpufreq policy related_cpus mask. Since the sched_domain
flags functions don't take any parameters we have to assume that flags
are the same for sched_domains are the same level, i.e. platforms mixing
per-core and per-cluster DVFS is not supported.

cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

3ffc4719

drivers base/arch_topology: enforce SCHED_CAPACITY_SCALE as highest CPU capacity · 978bb184

Dietmar Eggemann authored Jun 06, 2017



The default CPU capacity is SCHED_CAPACITY_SCALE (1024).

On a heterogeneous system (hmp) this value can be smaller for some cpus.
The CPU capacity parsing code normalizes the capacity-dmips-mhz
properties w.r.t. the highest value found while parsing the DT to
SCHED_CAPACITY_SCALE.

CPU capacity can also be changed by writing to
/sys/devices/system/cpu/cpu*/cpu_capacity.

To make sure that a subset of all online cpus still has a CPU capacity
value of SCHED_CAPACITY_SCALE enforce in the appropriate sysfs attribute
store function cpu_capacity_store().

This will avoid weird setup's like transforming an hmp into an smp
system with a CPU capacity < SCHED_CAPACITY_SCALE for all cpus.

The current cpu_capacity_store() assumes that all cpus of a cluster have
the same CPU capacity value which is true for existing hmp systems (e.g.
big.LITTLE). This assumption is also used by this patch.
If the new CPU capacity value for a cpu is smaller than
SCHED_CAPACITY_SCALE we iterate over the cpus which do not belong to the
cpu's cluster and check that there is still a cpu with CPU capacity
equal SCHED_CAPACITY_SCALE.

The use of &cpu_topology[this_cpu].core_sibling is replaced by
topology_core_cpumask(this_cpu).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

978bb184

drivers base/arch_topology: fold two pr_debug()'s into one · 978079a9

Dietmar Eggemann authored Jun 08, 2017



Output cpu_capacity and raw_capacity in one pr_debug instead of using
two.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

978079a9

Per Sched domain over utilization · 2e28b34e

Thara Gopinath authored Jun 23, 2017

The current implementation of overutilization, aborts energy aware
scheduling if any cpu in the system is over-utilized. This patch introduces
over utilization flag per sched domain level instead of a single flag
system wide. Load balancing is done at the sched domain where any
of the cpu is over utilized. If energy aware scheduling is
enabled and no cpu in a sched domain is overuttilized,
load balancing is skipped for that sched domain and energy aware
scheduling continues at that level.

The implementation takes advantage of the shared sched_domain structure
that is common across all the sched domains at a level. The new flag
introduced is placed in this structure so that all the sched domains the
same level share the flag. In case of an overutilized cpu, the flag gets
set at level1 sched_domain. The flag at the parent sched_domain level gets
set in either of the two following scenarios.
1. There is a misfit task in one of the cpu's in this sched_domain.
2. The total utilization of the domain is greater than the domain capacity

The flag is cleared if no cpu in a sched domain is overutilized.

This implementation still can have corner scenarios with respect to
misfit tasks. For example consider a sched group with n cpus and
n+1 70%utilized tasks. Ideally this is a case for load balance to happen
in a parent sched domain. But neither the total group utilization is
high enough for the load balance to be triggered
in the parent domain nor there is a cpu with a single overutilized task so
that aload balance is triggered in a parent domain. But again this could be
a purely academic sceanrio, as during task wake up these tasks will be placed
more appropriately.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>

2e28b34e

sched: Disable energy-unfriendly nohz kicks · e855790b

Morten Rasmussen authored Feb 03, 2015



With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

e855790b

sched: Consider a not over-utilized energy-aware system as balanced · cbd0e38b

Dietmar Eggemann authored May 10, 2015



In case the system operates below the tipping point indicator,
introduced in ("sched: Add over-utilization/tipping point
indicator"), bail out in find_busiest_group after the dst and src
group statistics have been checked.

There is simply no need to move usage around because all involved
cpus still have spare cycles available.

For an energy-aware system below its tipping point,  we rely on the
task placement of the wakeup path. This works well for short running
tasks.

The existence of long running tasks on one of the involved cpus lets
the system operate over its tipping point. To be able to move such
a task (whose load can't be used to average the load among the cpus)
from a src cpu with lower capacity than the dst_cpu, an additional
rule has to be implemented in need_active_balance.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

cbd0e38b

sched/fair: Energy-aware wake-up task placement · 21760286

Morten Rasmussen authored Mar 30, 2016

When the systems is not overutilized, place waking tasks on the most
energy efficient cpu. Previous attempts reduced the search space by
matching task utilization to cpu capacity before consulting the energy
model as this is an expensive operation. The search heuristics didn't
work very well and lacking any better alternatives this patch takes the
brute-force route and tries all potential targets.

This approach doesn't scale, but it might be sufficient for many
embedded applications while work is continuing on a heuristic that can
minimize the necessary computations. The heuristic must be derrived from
the platform energy model rather than make additional assumptions, such
lower capacity implies better energy efficiency. PeterZ mentioned in the
past that we might be able to derrive some simpler deciding functions
using mathematical (modal?) analysis.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

21760286

sched: Add over-utilization/tipping point indicator · 31453599

Morten Rasmussen authored May 09, 2015



Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

31453599

sched/fair: Add energy_diff dead-zone margin · 75d2c160

Morten Rasmussen authored Mar 30, 2016

It is not worth the overhead to migrate tasks for tiny insignificant
energy savings. To prevent this, an energy margin is introduced in
energy_diff() which effectively adds a dead-zone that rounds tiny energy
differences to zero. Since no scale is enforced for energy model data
the margin can't be absolute. Instead it is defined as +/-1.56% energy
saving compared to the current total estimated energy consumption.

75d2c160

sched: Determine the current sched_group idle-state · a14c0a78

Dietmar Eggemann authored Jan 27, 2015



To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

a14c0a78

sched, cpuidle: Track cpuidle state index in the scheduler · 729ee534

Morten Rasmussen authored Jan 27, 2015



The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

729ee534

sched: Estimate energy impact of scheduling decisions · a8e979ab

Morten Rasmussen authored Jan 06, 2015



Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

a8e979ab

sched: Extend sched_group_energy to test load-balancing decisions · 641f3965

Morten Rasmussen authored Jan 02, 2015



Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

641f3965

sched: Calculate energy consumption of sched_group · 02f3f6bc

Morten Rasmussen authored Dec 18, 2014



For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.

NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

02f3f6bc

sched: Highest energy aware balancing sched_domain level pointer · 8bb479f6

Morten Rasmussen authored Jan 02, 2015



Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.

Partial energy model information is possible but restricted to providing
energy model data for lower level sched_domains (sd_ea and below) and
leaving load-balancing on levels above to non-energy-aware
load-balancing. For example, it is possible to apply energy-aware
scheduling within each socket on a multi-socket system and let normal
scheduling handle load-balancing between sockets.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

8bb479f6

sched: Relocated cpu_util() and change return type · 557f2f5b

Morten Rasmussen authored Dec 11, 2014



Move cpu_util() to an earlier position in fair.c and change return
type to unsigned long as negative usage doesn't make much sense. All
other load and capacity related functions use unsigned long including
the caller of cpu_util().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

557f2f5b

sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability · c6952a4d

Dietmar Eggemann authored Jul 30, 2015



For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.

Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:

1. Only 1 cpu is/remains in one cluster of a multi cluster system.

   This remaining cpu only has DIE and no MC sd.

2. A complete cluster in a two cluster system is hot-plugged out.

   The cpus of the remaining cluster only have MC and no DIE sd.

To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:

1. There is only 1 cpu left in the sd.

2. There have to be at least 2 sg's if certain sd flags are set.

Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.

It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.

In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.

However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.

The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.

Test (w/ CONFIG_SCHED_DEBUG):

JUNO r0 default system:

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f

Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:

$ echo 0 > /sys/devices/system/cpu/cpu1/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for remaining A57 (cpu2) cpu:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f

Test 2: Hotplug-out the entire A57 cluster:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for the remaining A53 (CPU part 0xd03) cluster:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

c6952a4d

sched: Introduce SD_SHARE_CAP_STATES sched_domain flag · 85800ef2

Morten Rasmussen authored Jan 13, 2015



cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).

There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.

cc: Russell King <linux@arm.linux.org.uk>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

85800ef2

sched: Initialize energy data structures · fdd5d497

Dietmar Eggemann authored Nov 14, 2014



The sched_group_energy (sge) pointer of the first sched_group (sg) in
the sched_domain (sd) is initialized to point to the appropriate (in
terms of sd level and cpu) sge data defined in the arch and so to the
correct part of the Energy Model (EM).

Energy-aware scheduling allows that a system has only EM data up to a
certain sd level (so called highest energy aware balancing sd level).
A check in init_sched_energy() enforces that all sd's below this sd
level contain EM data.

The 'int cpu' parameter of sched_domain_energy_f requires that
check_sched_energy_data() makes sure that all cpus spanned by a sg
are provisioned with the same EM data.

This patch has also been tested with feature FORCE_SD_OVERLAP enabled.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

fdd5d497

sched: Introduce energy data structures · 2b1bacc1

Dietmar Eggemann authored Nov 14, 2014



The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:

  (1) number of elements of the idle state array
  (2) pointer to the idle state array which comprises 'power consumption'
      for each idle state
  (3) number of elements of the capacity state array
  (4) pointer to the capacity state array which comprises 'compute
      capacity and power consumption' tuples for each capacity state

The struct sched_group obtains a pointer to a struct sched_group_energy.

The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.

The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.

It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

2b1bacc1

sched: Make energy awareness a sched feature · cd1cb1ff

Morten Rasmussen authored Jan 13, 2015



This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.

ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

cd1cb1ff

sched: Documentation for scheduler energy cost model · dcbf44c5

Morten Rasmussen authored Jan 13, 2015



This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

dcbf44c5

sched/fair: Update blocked load from newly idle balance · 3f8b1be4

Brendan Jackman authored Oct 24, 2017

We now have a NOHZ kick to avoid the load of idle CPUs becoming stale. This is
good, but it brings about CPU wakeups, which have an energy cost. As an
alternative to waking CPUs up to do decay blocked load, we can sometimes do it
from newly idle balance. If the newly idle balance is on a domain that covers
all the currently nohz-idle CPUs, we push the value of nohz.next_update into the
future. That means that if such newly idle balances happen often enough, we
never need wake up a CPU just to update load.

Since we're doing this new update inside a for_each_domain, we need to do
something to avoid doing multiple updates on the same CPU in the same
idle_balance. A tick stamp is set on the rq in update_blocked_averages as a
simple way to do this. Using a simple jiffies-based timestamp, as opposed to the
last_update_time of the root cfs_rq's sched_avg, means we can do this without
taking the rq lock.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>

3f8b1be4

sched: force update of blocked load of idle cpus · a4b7e38c

Vincent Guittot authored Oct 24, 2017



When idle, the blocked load of CPUs will be updated only when an idle
load balance is triggered which may never happen. Because of this
uncertainty on the execution of idle load balance, the utilization,
the load and the shares of idle cfs_rq can stay artificially high and
steal shares and running time to busy cfs_rqs of the task group.
Add a new light idle load balance state which ensures that blocked loads
are periodically updated and decayed but does not perform any task
migration.

The remote load udpates are rate-limited, so that they are not
performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT
half-life). This is the period after which we have a known 50% error
in stale load.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[Switched remote update interval to use PELT half life]
[Moved update_blocked_averges call outside rebalance_domains
 to simplify code]
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>

a4b7e38c

arm64: Enable dynamic sched_domain flag setting · 6ce2eef5

Morten Rasmussen authored Oct 19, 2017



The patch lets the arch_topology driver take over setting of
sched_domain flags that should be detected dynamically based on the
actual system topology.

cc: Catalin Marinas <catalin.marinas@arm.com>
cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

6ce2eef5