sched/fair: Decrease util_est in presence of idle time (815de241) · Commits · linux-arm / linux-pg

Commit 815de241 authored Dec 13, 2024 by Pierre Gondois
sched/fair: Decrease util_est in presence of idle time

util_est signal does not decay if the task utilization is lower
than its runnable signal by a value of 10. This was done to keep
the util_est signal high in case a task shares a rq with another
task and doesn't obtain a desired running time.

However, tasks sharing a rq obtain the running time they desire
provided that the rq has some idle time. Indeed, either:
- a CPU is always running. The utilization signal of tasks reflects
  the running time they obtained. This running time depends on the
  niceness of the tasks. A decreasing utilization signal doesn't
  reflect a decrease of the task activity and the util_est signal
  should not be decayed in this case.
- a CPU is not always running (i.e. there is some idle time). Tasks
  might be waiting to run, increasing their runnable signal, but
  eventually run to completion. A decreasing utilization signal
  does reflect a decrease of the task activity and the util_est
  signal should be decayed in this case.

------------

Running on a 1024 capacity CPU:
- TaskA:
  - duty_cycle=60%, period=16ms, duration=2s
- TaskB:
  - sleep=2ms (to let TaskA run first), duration=1s
  - first phase: duty_cycle=20%, period=16ms, duration=1s
  - second phase: duty_cycle=5%, period=16ms, duration=1s

When TaskB starts the second phase, the util_est signal can take up to
~150ms before starting to decrease. Indeed, if TaskB runs after
TaskA, its runnable signal will be higher than its util signal by more
than 10 units.
This creates unnecessary frequency spikes: upon enqueuing TaskB,
util_est_cfs is increased by the old value of util_est of TaskB,
impacting frequency selection through:
sugov_get_util()
\-cpu_util_cfs_boost()
  \-cpu_util()
		util_est = READ_ONCE(cfs_rq->avg.util_est);

------------

Energy impact can also be measured as suggested by Hongyan at [2].
On a Pixel6, the following workload is run 10 times:
- TaskA:
  - duty_cycle=20%, duration=0.4s
  - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
  - Used to increase the runnable signals of TaskB
- TaskB:
  - sleep=2ms (to let TaskA run first)
  - first phase: duty_cycle=20%, duration=0.2s
  - second phase: duty_cycle=5%, duration=0.2s
  - 4 occurrences of TaskB

The duration of the workload is low (0.4s) to emphasis the impact of
continuing to run at an overestimated frequency.

Energy measured with energy counters:
┌────────────┬────────────┬───────────┬───────────┬───────────┐
│ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
│ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆     -9.16 │
└────────────┴────────────┴───────────┴───────────┴───────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬───────────┬───────────┬───────────┐
│ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
│  4.8318e9  ┆   4.0823e9 ┆ 5.1044e8  ┆  7.5558e8 ┆    -15.51 │
└────────────┴────────────┴───────────┴───────────┴───────────┘

------------

The initial patch [2] aimed to solve an issue detected while running
speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
versions are compared:
- base: the current version
- patch: the new version, with this patch applied
- revert: the initial version, with commit [2] reverted

Score (higher is better):
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│      0.57 ┆      0.49 ┆       0.58 │
└───────────┴───────────┴────────────┘

Energy measured with energy counters:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│   1347.13 ┆   2431.67 ┆     510.88 │
└───────────┴───────────┴────────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
└───────────┴───────────┴────────────┘

OU ratio in % (ratio of time being overutilized over total time).
The test lasts ~65s:
┌────────────┬────────────┬─────────────┐
│ base mean  ┆ patch mean ┆ revert mean │
╞════════════╪════════════╪═════════════╡
│     63.39% ┆     12.48% ┆      12.28% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┬─────────────┐
│ base std  ┆ patch std ┆ revert mean │
╞═══════════╪═══════════╪═════════════╡
│      0.97 ┆      0.28 ┆        0.88 │
└───────────┴───────────┴─────────────┘

The energy gain can be explained by the fact that the system is
overutilized during most of the test with the base version.

During the test, the base condition is evaluated to true ~40%
of the time. The new condition is evaluated to true ~2% of
the time. Preventing util_est signals to decay with the base
condition has a significant impact on the overutilized state
due to an overestimation of the resulting utilization of tasks.

The score is impacted by the patch, but:
- it is expected to have slightly lower scores with EAS running more
  often
- the base version making the system run at higher frequencies by
  overestimating task utilization, it is expected to have higher scores

------------

Decrease util_est when the rq has some idle time instead of a delta
between util and runnable signals.

[1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
[2] commit 50181c0c ("sched/pelt: Avoid underestimation of task utilization")
https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
[3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t



Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
parent cf016b81
Hide whitespace changes
Inline Side-by-side
Please register or to comment