Skip to content
  1. Nov 23, 2017
    • Dietmar Eggemann's avatar
      sched/events: Introduce task_group load tracking trace event · 6d869c5b
      Dietmar Eggemann authored
      
      
      The trace event key load is mapped to:
      
       (1) load : cfs_rq->tg->load_avg
      
      The cfs_rq owned by the task_group is used as the only parameter for the
      trace event because it has a reference to the taskgroup and the cpu.
      Using the taskgroup as a parameter instead would require the cpu as a
      second parameter. A task_group is global and not per-cpu data. The cpu
      key only tells on which cpu the value was gathered.
      
      The following list shows examples of the key=value pairs for:
      
       (1) a task group:
      
           cpu=1 path=/tg1/tg11/tg111 load=517
      
       (2) an autogroup:
      
           cpu=1 path=/autogroup-10 load=1050
      
      We don't maintain a load signal for a root task group.
      
      The trace event is only defined if cfs group scheduling support
      (CONFIG_FAIR_GROUP_SCHED) is enabled.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      6d869c5b
    • Dietmar Eggemann's avatar
      sched/events: Introduce sched_entity load tracking trace event · 6d86076a
      Dietmar Eggemann authored
      
      
      The following trace event keys are mapped to:
      
       (1) load     : se->avg.load_avg
      
       (2) rbl_load : se->avg.runnable_load_avg
      
       (3) util     : se->avg.util_avg
      
      To let this trace event work for configurations w/ and w/o group
      scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
      special handling is necessary for non-existent key=value pairs:
      
       path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the
                         sched_entity represents a task.
      
       comm = "(null)" : In case sched_entity represents a task_group.
      
       pid = -1        : In case sched_entity represents a task_group.
      
      The following list shows examples of the key=value pairs in different
      configurations for:
      
       (1) a task:
      
           cpu=0 path=(null) comm=sshd pid=2206 load=102 rbl_load=102  util=102
      
       (2) a taskgroup:
      
           cpu=1 path=/tg1/tg11/tg111 comm=(null) pid=-1 load=882 rbl_load=882 util=510
      
       (3) an autogroup:
      
           cpu=0 path=/autogroup-13 comm=(null) pid=-1 load=49 rbl_load=49 util=48
      
       (4) w/o CONFIG_FAIR_GROUP_SCHED:
      
           cpu=0 path=(null) comm=sshd pid=2211 load=301 rbl_load=301 util=265
      
      The trace event is only defined for CONFIG_SMP.
      
      The helper functions __trace_sched_cpu(), __trace_sched_path() and
      __trace_sched_id() are extended to deal with sched_entities as well.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      6d86076a
    • Dietmar Eggemann's avatar
      sched/events: Introduce cfs_rq load tracking trace event · c4322f76
      Dietmar Eggemann authored
      
      
      The following trace event keys are mapped to:
      
       (1) load     : cfs_rq->avg.load_avg
      
       (2) rbl_load : cfs_rq->avg.runnable_load_avg
      
       (2) util     : cfs_rq->avg.util_avg
      
      To let this trace event work for configurations w/ and w/o group
      scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
      special handling is necessary for a non-existent key=value pair:
      
       path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.
      
      The following list shows examples of the key=value pairs in different
      configurations for:
      
       (1) a root task_group:
      
           cpu=4 path=/ load=6 rbl_load=6 util=331
      
       (2) a task_group:
      
           cpu=1 path=/tg1/tg11/tg111 load=538 rbl_load=538 util=522
      
       (3) an autogroup:
      
           cpu=3 path=/autogroup-18 load=997 rbl_load=997 util=517
      
       (4) w/o CONFIG_FAIR_GROUP_SCHED:
      
           cpu=0 path=(null) load=314 rbl_load=314 util=289
      
      The trace event is only defined for CONFIG_SMP.
      
      The helper function __trace_sched_path() can be used to get the length
      parameter of the dynamic array (path == NULL) and to copy the path into
      it (path != NULL).
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      c4322f76
    • Dietmar Eggemann's avatar
      sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG · 08ae6215
      Dietmar Eggemann authored
      
      
      Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If
      CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be
      available to be printed in the load tracking trace events provided by
      this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      08ae6215
    • Dietmar Eggemann's avatar
      sched/debug: Add energy procfs interface · 4e6424c7
      Dietmar Eggemann authored
      
      
      This patch makes the energy data available via procfs. The related files
      are placed as sub-directory named 'energy' inside the
      /proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
      cpu/domain/group tuples which have energy information.
      
      The following example depicts the contents of
      /proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
      has energy information attached to domain level 0.
      
      ├── cpu0
      │   ├── domain0
      │   │   ├── busy_factor
      │   │   ├── busy_idx
      │   │   ├── cache_nice_tries
      │   │   ├── flags
      │   │   ├── forkexec_idx
      │   │   ├── group0
      │   │   │   └── energy
      │   │   │       ├── cap_states
      │   │   │       ├── idle_states
      │   │   │       ├── nr_cap_states
      │   │   │       └── nr_idle_states
      │   │   ├── group1
      │   │   │   └── energy
      │   │   │       ├── cap_states
      │   │   │       ├── idle_states
      │   │   │       ├── nr_cap_states
      │   │   │       └── nr_idle_states
      │   │   ├── idle_idx
      │   │   ├── imbalance_pct
      │   │   ├── max_interval
      │   │   ├── max_newidle_lb_cost
      │   │   ├── min_interval
      │   │   ├── name
      │   │   ├── newidle_idx
      │   │   └── wake_idx
      │   └── domain1
      │       ├── busy_factor
      │       ├── busy_idx
      │       ├── cache_nice_tries
      │       ├── flags
      │       ├── forkexec_idx
      │       ├── idle_idx
      │       ├── imbalance_pct
      │       ├── max_interval
      │       ├── max_newidle_lb_cost
      │       ├── min_interval
      │       ├── name
      │       ├── newidle_idx
      │       └── wake_idx
      
      The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
      whereas 'idle_states' and 'cap_states' contain a vector of power
      consumption at this idle state respectively (compute capacity, power
      consumption) at this capacity state.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      4e6424c7
    • Dietmar Eggemann's avatar
      arm64: use cpu scale value derived from energy model · cc65f03d
      Dietmar Eggemann authored
      
      
      To make sure that the capacity value of the last element of the capacity
      states vector of the energy model (EM) core (MC) level is equal to the
      cpu scale value, use this capacity value to overwrite the cpu scale
      value preeviously derived from the Cpu Invariant Engine (CIE).
      
      This patch is necessary as long as there is no complete EM support in
      device tree.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      cc65f03d
    • Dietmar Eggemann's avatar
      arm64: define hikey620 sys sd energy model · cd4a2377
      Dietmar Eggemann authored
      
      
      Hi6220 has a single frequency domain spanning the two clusters. It
      needs the SYS sched domain (sd) to let the EAS algorithm work
      properly.
      
      The SD_SHARE_CAP_STATES flag is not set on SYS sd.
      
      This lets sd_ea (highest sd w/ energy model data) point to the SYS
      sd whereas sd_scs (highest sd w/ SD_SHARE_CAP_STATES set) points to
      the DIE sd. This setup allows the code in sched_group_energy() to
      set sg_shared_cap to the single sched group of the SYS sd covering
      all the cpus in the system as they are all part of the single
      frequency domain.
      
      The capacity and idle state vectors only contain entries w/ power
      values equal zero, so there is no system-wide energy contribution.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      cd4a2377
    • Dietmar Eggemann's avatar
      arm64: introduce sys sd energy model infrastructure · 49dc3bb7
      Dietmar Eggemann authored
      Allow the energy model to contain a system level besides the already
      existing core and cluster level.
      
      This is necessary for platforms with frequency domains spanning all
      cpus to let the EAS algorithm work properly.
      
      The whole idea of this system level has to be rethought once
      the idea of the 'struct sched_domain_shared' gets more momentum:
      
      https://lkml.org/lkml/2016/6/16/209
      
      
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      49dc3bb7
    • Dietmar Eggemann's avatar
      arm64, dts: add hikey cpu capacity-dmips-mhz information · 00ee664a
      Dietmar Eggemann authored
      
      
      Hikey is an SMP platform, so this property would normally not be necessary.
      
      But since we drive the setting of the EAS specific sched domain flag
      SD_SHARE_CAP_STATES via the init_cpu_capacity_callback() cpufreq notifier
      we have to make sure that cap_parsing_failed is not set to true in
      parse_cpu_capacity() so that init_cpu_capacity_callback() will bail out
      before consuming the CPUFREQ_NOTIFY. The easiest way to achieve this is to
      provide the dts file with this property.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      00ee664a
    • Dietmar Eggemann's avatar
      cd52d507
    • Dietmar Eggemann's avatar
      arm64: factor out energy model from topology shim layer · 60d66416
      Dietmar Eggemann authored
      
      
      To be able to support multiple energy models before we have the
      full-fletched dt solution in arm64 (e.g. for platform Arm Juno and
      Hisilicon Hikey) factor out the static energy model data and the
      appropriate access function into energy_model.h.
      
      The patch uses of_match_node() to match the compatible string with the
      appropriate platform energy model data, i.e. the patch introduces a
      dependency to CONFIG_OF_FLATTREE for propagating the energy model data
      towards the task scheduler.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      60d66416
    • Juri Lelli's avatar
      arm64, topology: Define JUNO energy and provide it to the scheduler · 1e983941
      Juri Lelli authored
      
      
      This patch is only here to be able to test provisioning of energy related
      data from an arch topology shim layer to the scheduler. Since there is no
      code today which deals with extracting energy related data from the dtb or
      acpi, and process it in the topology shim layer, the content of the
      sched_group_energy structures as well as the idle_state and capacity_state
      arrays are hard-coded here.
      
      This patch defines the sched_group_energy structure as well as the
      idle_state and capacity_state array for the cluster (relates to sched
      groups (sgs) in DIE sched domain level) and for the core (relates to sgs
      in MC sd level) for a Cortex A53 as well as for a Cortex A57.
      It further provides related implementations of the sched_domain_energy_f
      functions (cpu_cluster_energy() and cpu_core_energy()).
      
      To be able to propagate this information from the topology shim layer to
      the scheduler, the elements of the arm_topology[] table have been
      provisioned with the appropriate sched_domain_energy_f functions.
      
      Signed-off-by: default avatarJuri Lelli <juri.lelli@arm.com>
      1e983941
    • Dietmar Eggemann's avatar
      arm: use cpu scale value derived from energy model · 2ba06bfb
      Dietmar Eggemann authored
      
      
      To make sure that the capacity value of the last element of the capacity
      states vector of the energy model (EM) core (MC) level is equal to the
      cpu scale value, use this capacity value to overwrite the cpu scale
      value preeviously derived from the Cpu Invariant Engine (CIE).
      
      This patch is necessary as long as there is no complete EM support in
      device tree.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      2ba06bfb
    • Dietmar Eggemann's avatar
      arm: topology: Define TC2 energy and provide it to the scheduler · 7a35cdc0
      Dietmar Eggemann authored
      
      
      This patch is only here to be able to test provisioning of energy related
      data from an arch topology shim layer to the scheduler. Since there is no
      code today which deals with extracting energy related data from the dtb or
      acpi, and process it in the topology shim layer, the content of the
      sched_group_energy structures as well as the idle_state and capacity_state
      arrays are hard-coded here.
      
      This patch defines the sched_group_energy structure as well as the
      idle_state and capacity_state array for the cluster (relates to sched
      groups (sgs) in DIE sched domain level) and for the core (relates to sgs
      in MC sd level) for a Cortex A7 as well as for a Cortex A15.
      It further provides related implementations of the sched_domain_energy_f
      functions (cpu_cluster_energy() and cpu_core_energy()).
      
      To be able to propagate this information from the topology shim layer to
      the scheduler, the elements of the arm_topology[] table have been
      provisioned with the appropriate sched_domain_energy_f functions.
      
      cc: Russell King <linux@arm.linux.org.uk>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      7a35cdc0
    • Dietmar Eggemann's avatar
      arm64, arm: Tweak defconfig and multi_v7_defconfig for EAS integration · aedb6d5c
      Dietmar Eggemann authored
      
      
      for arm64 and arm:
      
      Add    CpuFreq governors and make schedutil default
      Add    Sched Debug
      Add    Ftrace
      Add    Function, Function Graph, Irqsoff, Preempt, Sched Tracer
      Add    Prove Locking
      
      for arm64:
      
      Add    Generic DT based CpuFreq driver - for hikey
      Add    USB Net AX8817X                 - for hikey
      Add    USB Net RTL8152		       - for hikey
      Add    HI6220 stub clock               - for hikey
      
      for arm:
      
      Add    Kernel .config support and /proc/config.gz
      Add    DIE sched domain level
      Add    Scheduler autogroups
      Add    ARM Big.Little cpufreq driver - for TC2
      Add    ARM Big.Little cpuidle driver - for TC2
      Add    Sensor Vexpress               - for TC2
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      aedb6d5c
    • Morten Rasmussen's avatar
      drivers base/arch_topology: Detect SD_SHARE_CAP_STATES flag · 3ffc4719
      Morten Rasmussen authored
      
      
      Detect and set the SD_SHARE_CAP_STATES sched_domain flag automatically
      based on the cpufreq policy related_cpus mask. Since the sched_domain
      flags functions don't take any parameters we have to assume that flags
      are the same for sched_domains are the same level, i.e. platforms mixing
      per-core and per-cluster DVFS is not supported.
      
      cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      3ffc4719
    • Dietmar Eggemann's avatar
      drivers base/arch_topology: enforce SCHED_CAPACITY_SCALE as highest CPU capacity · 978bb184
      Dietmar Eggemann authored
      
      
      The default CPU capacity is SCHED_CAPACITY_SCALE (1024).
      
      On a heterogeneous system (hmp) this value can be smaller for some cpus.
      The CPU capacity parsing code normalizes the capacity-dmips-mhz
      properties w.r.t. the highest value found while parsing the DT to
      SCHED_CAPACITY_SCALE.
      
      CPU capacity can also be changed by writing to
      /sys/devices/system/cpu/cpu*/cpu_capacity.
      
      To make sure that a subset of all online cpus still has a CPU capacity
      value of SCHED_CAPACITY_SCALE enforce in the appropriate sysfs attribute
      store function cpu_capacity_store().
      
      This will avoid weird setup's like transforming an hmp into an smp
      system with a CPU capacity < SCHED_CAPACITY_SCALE for all cpus.
      
      The current cpu_capacity_store() assumes that all cpus of a cluster have
      the same CPU capacity value which is true for existing hmp systems (e.g.
      big.LITTLE). This assumption is also used by this patch.
      If the new CPU capacity value for a cpu is smaller than
      SCHED_CAPACITY_SCALE we iterate over the cpus which do not belong to the
      cpu's cluster and check that there is still a cpu with CPU capacity
      equal SCHED_CAPACITY_SCALE.
      
      The use of &cpu_topology[this_cpu].core_sibling is replaced by
      topology_core_cpumask(this_cpu).
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      978bb184
    • Dietmar Eggemann's avatar
      drivers base/arch_topology: fold two pr_debug()'s into one · 978079a9
      Dietmar Eggemann authored
      
      
      Output cpu_capacity and raw_capacity in one pr_debug instead of using
      two.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      978079a9
    • Thara Gopinath's avatar
      Per Sched domain over utilization · 2e28b34e
      Thara Gopinath authored
      
      
      The current implementation of overutilization, aborts energy aware
      scheduling if any cpu in the system is over-utilized. This patch introduces
      over utilization flag per sched domain level instead of a single flag
      system wide.  Load balancing is done at the sched domain where any
      of the cpu is over utilized. If energy aware scheduling is
      enabled and no cpu in a sched domain is overuttilized,
      load balancing is skipped for that sched domain and energy aware
      scheduling continues at that level.
      
      The implementation takes advantage of the shared sched_domain structure
      that is common across all the sched domains at a level. The new flag
      introduced is placed in this structure so that all the sched domains the
      same level share the flag. In case of an overutilized cpu, the flag gets
      set at level1 sched_domain. The flag at the parent sched_domain level gets
      set in either of the two following scenarios.
       1. There is a misfit task in one of the cpu's in this sched_domain.
       2. The total utilization of the domain is greater than the domain capacity
      
      The flag is cleared if no cpu in a sched domain is overutilized.
      
      This implementation still can have corner scenarios with respect to
      misfit tasks. For example consider a sched group with n cpus and
      n+1 70%utilized tasks. Ideally this is a case for load balance to happen
      in a parent sched domain. But neither the total group utilization is
      high enough for the load balance to be triggered
      in the parent domain nor there is a cpu with a single overutilized task so
      that aload balance is triggered in a parent domain. But again this could be
      a purely academic sceanrio, as during task wake up these tasks will be placed
      more appropriately.
      
      Signed-off-by: default avatarThara Gopinath <thara.gopinath@linaro.org>
      2e28b34e
    • Morten Rasmussen's avatar
      sched: Disable energy-unfriendly nohz kicks · e855790b
      Morten Rasmussen authored
      
      
      With energy-aware scheduling enabled nohz_kick_needed() generates many
      nohz idle-balance kicks which lead to nothing when multiple tasks get
      packed on a single cpu to save energy. This causes unnecessary wake-ups
      and hence wastes energy. Make these conditions depend on !energy_aware()
      for now until the energy-aware nohz story gets sorted out.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      e855790b
    • Dietmar Eggemann's avatar
      sched: Consider a not over-utilized energy-aware system as balanced · cbd0e38b
      Dietmar Eggemann authored
      
      
      In case the system operates below the tipping point indicator,
      introduced in ("sched: Add over-utilization/tipping point
      indicator"), bail out in find_busiest_group after the dst and src
      group statistics have been checked.
      
      There is simply no need to move usage around because all involved
      cpus still have spare cycles available.
      
      For an energy-aware system below its tipping point,  we rely on the
      task placement of the wakeup path. This works well for short running
      tasks.
      
      The existence of long running tasks on one of the involved cpus lets
      the system operate over its tipping point. To be able to move such
      a task (whose load can't be used to average the load among the cpus)
      from a src cpu with lower capacity than the dst_cpu, an additional
      rule has to be implemented in need_active_balance.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      cbd0e38b
    • Morten Rasmussen's avatar
      sched/fair: Energy-aware wake-up task placement · 21760286
      Morten Rasmussen authored
      
      
      When the systems is not overutilized, place waking tasks on the most
      energy efficient cpu. Previous attempts reduced the search space by
      matching task utilization to cpu capacity before consulting the energy
      model as this is an expensive operation. The search heuristics didn't
      work very well and lacking any better alternatives this patch takes the
      brute-force route and tries all potential targets.
      
      This approach doesn't scale, but it might be sufficient for many
      embedded applications while work is continuing on a heuristic that can
      minimize the necessary computations. The heuristic must be derrived from
      the platform energy model rather than make additional assumptions, such
      lower capacity implies better energy efficiency. PeterZ mentioned in the
      past that we might be able to derrive some simpler deciding functions
      using mathematical (modal?) analysis.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      21760286
    • Morten Rasmussen's avatar
      sched: Add over-utilization/tipping point indicator · 31453599
      Morten Rasmussen authored
      
      
      Energy-aware scheduling is only meant to be active while the system is
      _not_ over-utilized. That is, there are spare cycles available to shift
      tasks around based on their actual utilization to get a more
      energy-efficient task distribution without depriving any tasks. When
      above the tipping point task placement is done the traditional way based
      on load_avg, spreading the tasks across as many cpus as possible based
      on priority scaled load to preserve smp_nice. Below the tipping point we
      want to use util_avg instead. We need to define a criteria for when we
      make the switch.
      
      The util_avg for each cpu converges towards 100% (1024) regardless of
      how many task additional task we may put on it. If we define
      over-utilized as:
      
      sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)
      
      some individual cpus may be over-utilized running multiple tasks even
      when the above condition is false. That should be okay as long as we try
      to spread the tasks out to avoid per-cpu over-utilization as much as
      possible and if all tasks have the _same_ priority. If the latter isn't
      true, we have to consider priority to preserve smp_nice.
      
      For example, we could have n_cpus nice=-10 util_avg=55% tasks and
      n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
      likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
      getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
      over-utilized than 55%+60% for those cpus that have to be shared. The
      system utilization is only 85% of the system capacity, but we are
      breaking smp_nice.
      
      To be sure not to break smp_nice, we have defined over-utilization
      conservatively as when any cpu in the system is fully utilized at it's
      highest frequency instead:
      
      cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity
      
      IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
      to factor in priority to preserve smp_nice.
      
      With this definition, we can skip periodic load-balance as no cpu has an
      always-running task when the system is not over-utilized. All tasks will
      be periodic and we can balance them at wake-up. This conservative
      condition does however mean that some scenarios that could benefit from
      energy-aware decisions even if one cpu is fully utilized would not get
      those benefits.
      
      For system where some cpus might have reduced capacity on some cpus
      (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
      soon a just a single cpu is fully utilized as it might one of those with
      reduced capacity and in that case we want to migrate it.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      31453599
    • Morten Rasmussen's avatar
      sched/fair: Add energy_diff dead-zone margin · 75d2c160
      Morten Rasmussen authored
      It is not worth the overhead to migrate tasks for tiny insignificant
      energy savings. To prevent this, an energy margin is introduced in
      energy_diff() which effectively adds a dead-zone that rounds tiny energy
      differences to zero. Since no scale is enforced for energy model data
      the margin can't be absolute. Instead it is defined as +/-1.56% energy
      saving compared to the current total estimated energy consumption.
      75d2c160
    • Dietmar Eggemann's avatar
      sched: Determine the current sched_group idle-state · a14c0a78
      Dietmar Eggemann authored
      
      
      To estimate the energy consumption of a sched_group in
      sched_group_energy() it is necessary to know which idle-state the group
      is in when it is idle. For now, it is assumed that this is the current
      idle-state (though it might be wrong). Based on the individual cpu
      idle-states group_idle_state() finds the group idle-state.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      a14c0a78
    • Morten Rasmussen's avatar
      sched, cpuidle: Track cpuidle state index in the scheduler · 729ee534
      Morten Rasmussen authored
      
      
      The idle-state of each cpu is currently pointed to by rq->idle_state but
      there isn't any information in the struct cpuidle_state that can used to
      look up the idle-state energy model data stored in struct
      sched_group_energy. For this purpose is necessary to store the idle
      state index as well. Ideally, the idle-state data should be unified.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      729ee534
    • Morten Rasmussen's avatar
      sched: Estimate energy impact of scheduling decisions · a8e979ab
      Morten Rasmussen authored
      
      
      Adds a generic energy-aware helper function, energy_diff(), that
      calculates energy impact of adding, removing, and migrating utilization
      in the system.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      a8e979ab
    • Morten Rasmussen's avatar
      sched: Extend sched_group_energy to test load-balancing decisions · 641f3965
      Morten Rasmussen authored
      
      
      Extended sched_group_energy() to support energy prediction with usage
      (tasks) added/removed from a specific cpu or migrated between a pair of
      cpus. Useful for load-balancing decision making.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      641f3965
    • Morten Rasmussen's avatar
      sched: Calculate energy consumption of sched_group · 02f3f6bc
      Morten Rasmussen authored
      
      
      For energy-aware load-balancing decisions it is necessary to know the
      energy consumption estimates of groups of cpus. This patch introduces a
      basic function, sched_group_energy(), which estimates the energy
      consumption of the cpus in the group and any resources shared by the
      members of the group.
      
      NOTE: The function has five levels of identation and breaks the 80
      character limit. Refactoring is necessary.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      02f3f6bc
    • Morten Rasmussen's avatar
      sched: Highest energy aware balancing sched_domain level pointer · 8bb479f6
      Morten Rasmussen authored
      
      
      Add another member to the family of per-cpu sched_domain shortcut
      pointers. This one, sd_ea, points to the highest level at which energy
      model is provided. At this level and all levels below all sched_groups
      have energy model data attached.
      
      Partial energy model information is possible but restricted to providing
      energy model data for lower level sched_domains (sd_ea and below) and
      leaving load-balancing on levels above to non-energy-aware
      load-balancing. For example, it is possible to apply energy-aware
      scheduling within each socket on a multi-socket system and let normal
      scheduling handle load-balancing between sockets.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      8bb479f6
    • Morten Rasmussen's avatar
      sched: Relocated cpu_util() and change return type · 557f2f5b
      Morten Rasmussen authored
      
      
      Move cpu_util() to an earlier position in fair.c and change return
      type to unsigned long as negative usage doesn't make much sense. All
      other load and capacity related functions use unsigned long including
      the caller of cpu_util().
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      557f2f5b
    • Dietmar Eggemann's avatar
      sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability · c6952a4d
      Dietmar Eggemann authored
      
      
      For Energy-Aware Scheduling (EAS) to work properly, even in the
      case that there is only one cpu per cluster or that cpus are hot-plugged
      out, the Energy Model (EM) data on all energy-aware sched domains (sd)
      has to be present for all online cpus.
      
      Mainline sd hierarchy setup code will remove sd's which are not useful
      for task scheduling e.g. in the following situations:
      
      1. Only 1 cpu is/remains in one cluster of a multi cluster system.
      
         This remaining cpu only has DIE and no MC sd.
      
      2. A complete cluster in a two cluster system is hot-plugged out.
      
         The cpus of the remaining cluster only have MC and no DIE sd.
      
      To make sure that all online cpus keep all their energy-aware sd's,
      the sd degenerate functionality has been changed to not free a sd if
      its first sched group (sg) contains EM data in case:
      
      1. There is only 1 cpu left in the sd.
      
      2. There have to be at least 2 sg's if certain sd flags are set.
      
      Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
      flag. This will make sure that the EAS functionality will always see
      all energy-aware sd's for all online cpus.
      
      It will introduce a tiny performance degradation for operations on
      affected cpus since the hot-path macro for_each_domain() has to deal
      with sd's not contributing to task scheduling at all now.
      
      In most cases the exisiting code makes sure that task scheduling is not
      invoked on a sd with !SD_LOAD_BALANCE.
      
      However, a small change is necessary in update_sd_lb_stats() to make
      sure that sd->parent is only initialized to !NULL in case the parent sd
      contains more than 1 sg.
      
      The handling of newidle decay values before the SD_LOAD_BALANCE check in
      rebalance_domains() stays unchanged.
      
      Test (w/ CONFIG_SCHED_DEBUG):
      
      JUNO r0 default system:
      
      $ cat /proc/cpuinfo | grep "^CPU part"
      CPU part        : 0xd03
      CPU part        : 0xd07
      CPU part        : 0xd07
      CPU part        : 0xd03
      CPU part        : 0xd03
      CPU part        : 0xd03
      
      SD names and flags:
      
      $ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
      MC
      DIE
      MC
      DIE
      MC
      DIE
      MC
      DIE
      MC
      DIE
      MC
      DIE
      
      $ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
      832f
      102f
      832f
      102f
      832f
      102f
      832f
      102f
      832f
      102f
      832f
      102f
      
      Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:
      
      $ echo 0 > /sys/devices/system/cpu/cpu1/online
      
      $ cat /proc/cpuinfo | grep "^CPU part"
      CPU part        : 0xd03
      CPU part        : 0xd07
      CPU part        : 0xd03
      CPU part        : 0xd03
      CPU part        : 0xd03
      
      SD names and flags for remaining A57 (cpu2) cpu:
      
      $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
      MC
      DIE
      
      $ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
      832e <-- MC SD with !SD_LOAD_BALANCE
      102f
      
      Test 2: Hotplug-out the entire A57 cluster:
      
      $ echo 0 > /sys/devices/system/cpu/cpu1/online
      $ echo 0 > /sys/devices/system/cpu/cpu2/online
      
      $ cat /proc/cpuinfo | grep "^CPU part"
      CPU part        : 0xd03
      CPU part        : 0xd03
      CPU part        : 0xd03
      CPU part        : 0xd03
      
      SD names and flags for the remaining A53 (CPU part 0xd03) cluster:
      
      $ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
      MC
      DIE
      MC
      DIE
      MC
      DIE
      MC
      DIE
      
      $ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
      832f
      102e <-- DIE SD with !SD_LOAD_BALANCE
      832f
      102e
      832f
      102e
      832f
      102e
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      c6952a4d
    • Morten Rasmussen's avatar
      sched: Introduce SD_SHARE_CAP_STATES sched_domain flag · 85800ef2
      Morten Rasmussen authored
      
      
      cpufreq is currently keeping it a secret which cpus are sharing
      clock source. The scheduler needs to know about clock domains as well
      to become more energy aware. The SD_SHARE_CAP_STATES domain flag
      indicates whether cpus belonging to the sched_domain share capacity
      states (P-states).
      
      There is no connection with cpufreq (yet). The flag must be set by
      the arch specific topology code.
      
      cc: Russell King <linux@arm.linux.org.uk>
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      85800ef2
    • Dietmar Eggemann's avatar
      sched: Initialize energy data structures · fdd5d497
      Dietmar Eggemann authored
      
      
      The sched_group_energy (sge) pointer of the first sched_group (sg) in
      the sched_domain (sd) is initialized to point to the appropriate (in
      terms of sd level and cpu) sge data defined in the arch and so to the
      correct part of the Energy Model (EM).
      
      Energy-aware scheduling allows that a system has only EM data up to a
      certain sd level (so called highest energy aware balancing sd level).
      A check in init_sched_energy() enforces that all sd's below this sd
      level contain EM data.
      
      The 'int cpu' parameter of sched_domain_energy_f requires that
      check_sched_energy_data() makes sure that all cpus spanned by a sg
      are provisioned with the same EM data.
      
      This patch has also been tested with feature FORCE_SD_OVERLAP enabled.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      fdd5d497
    • Dietmar Eggemann's avatar
      sched: Introduce energy data structures · 2b1bacc1
      Dietmar Eggemann authored
      
      
      The struct sched_group_energy represents the per sched_group related
      data which is needed for energy aware scheduling. It contains:
      
        (1) number of elements of the idle state array
        (2) pointer to the idle state array which comprises 'power consumption'
            for each idle state
        (3) number of elements of the capacity state array
        (4) pointer to the capacity state array which comprises 'compute
            capacity and power consumption' tuples for each capacity state
      
      The struct sched_group obtains a pointer to a struct sched_group_energy.
      
      The function pointer sched_domain_energy_f is introduced into struct
      sched_domain_topology_level which will allow the arch to pass a particular
      struct sched_group_energy from the topology shim layer into the scheduler
      core.
      
      The function pointer sched_domain_energy_f has an 'int cpu' parameter
      since the folding of two adjacent sd levels via sd degenerate doesn't work
      for all sd levels. I.e. it is not possible for example to use this feature
      to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
      
      It was discussed that the folding of sd levels approach is preferable
      over the cpu parameter approach, simply because the user (the arch
      specifying the sd topology table) can introduce less errors. But since
      it is not working, the 'int cpu' parameter is the only way out. It's
      possible to use the folding of sd levels approach for
      sched_domain_flags_f and the cpu parameter approach for the
      sched_domain_energy_f at the same time though. With the use of the
      'int cpu' parameter, an extra check function has to be provided to make
      sure that all cpus spanned by a sched group are provisioned with the same
      energy data.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      2b1bacc1
    • Morten Rasmussen's avatar
      sched: Make energy awareness a sched feature · cd1cb1ff
      Morten Rasmussen authored
      
      
      This patch introduces the ENERGY_AWARE sched feature, which is
      implemented using jump labels when SCHED_DEBUG is defined. It is
      statically set false when SCHED_DEBUG is not defined. Hence this doesn't
      allow energy awareness to be enabled without SCHED_DEBUG. This
      sched_feature knob will be replaced later with a more appropriate
      control knob when things have matured a bit.
      
      ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
      must be enable. This dependency isn't checked at compile time yet.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      cd1cb1ff
    • Morten Rasmussen's avatar
      sched: Documentation for scheduler energy cost model · dcbf44c5
      Morten Rasmussen authored
      
      
      This documentation patch provides an overview of the experimental
      scheduler energy costing model, associated data structures, and a
      reference recipe on how platforms can be characterized to derive energy
      models.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      dcbf44c5
    • Brendan Jackman's avatar
      sched/fair: Update blocked load from newly idle balance · 3f8b1be4
      Brendan Jackman authored
      
      
      We now have a NOHZ kick to avoid the load of idle CPUs becoming stale. This is
      good, but it brings about CPU wakeups, which have an energy cost. As an
      alternative to waking CPUs up to do decay blocked load, we can sometimes do it
      from newly idle balance. If the newly idle balance is on a domain that covers
      all the currently nohz-idle CPUs, we push the value of nohz.next_update into the
      future. That means that if such newly idle balances happen often enough, we
      never need wake up a CPU just to update load.
      
      Since we're doing this new update inside a for_each_domain, we need to do
      something to avoid doing multiple updates on the same CPU in the same
      idle_balance. A tick stamp is set on the rq in update_blocked_averages as a
      simple way to do this. Using a simple jiffies-based timestamp, as opposed to the
      last_update_time of the root cfs_rq's sched_avg, means we can do this without
      taking the rq lock.
      
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      3f8b1be4
    • Vincent Guittot's avatar
      sched: force update of blocked load of idle cpus · a4b7e38c
      Vincent Guittot authored
      
      
      When idle, the blocked load of CPUs will be updated only when an idle
      load balance is triggered which may never happen. Because of this
      uncertainty on the execution of idle load balance, the utilization,
      the load and the shares of idle cfs_rq can stay artificially high and
      steal shares and running time to busy cfs_rqs of the task group.
      Add a new light idle load balance state which ensures that blocked loads
      are periodically updated and decayed but does not perform any task
      migration.
      
      The remote load udpates are rate-limited, so that they are not
      performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT
      half-life). This is the period after which we have a known 50% error
      in stale load.
      
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Vincent Guittot's avatarVincent Guittot <vincent.guittot@linaro.org>
      [Switched remote update interval to use PELT half life]
      [Moved update_blocked_averges call outside rebalance_domains
       to simplify code]
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      a4b7e38c
    • Morten Rasmussen's avatar
      arm64: Enable dynamic sched_domain flag setting · 6ce2eef5
      Morten Rasmussen authored
      
      
      The patch lets the arch_topology driver take over setting of
      sched_domain flags that should be detected dynamically based on the
      actual system topology.
      
      cc: Catalin Marinas <catalin.marinas@arm.com>
      cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      6ce2eef5
Loading