sched/events: Improving scheduler debugging/testing tracing interfaces (2302b0f3) · Commits · linux-arm / linux-lm

Commit 2302b0f3 authored Jan 21, 2025 by Luis Machado
sched/events: Improving scheduler debugging/testing tracing interfaces

This patch is primarily a proposal aimed at improving the observability of the
scheduler, especially in the context of energy-aware scheduling, without
introducing long-term maintenance burdens such as a stable userspace ABI. I’m
seeking feedback from the community on whether this approach seems viable, or
if there are suggestions for making it more robust or maintainable.

Today, validating that a set of scheduler changes behaves sanely and doesn’t
regress performance or energy metrics can be time-consuming. On the
energy-aware side in particular, this often requires significant manual
intervention to collect, post-process, and analyze data.

Another challenge is the limited availability of platforms that can run a
mainline kernel while still exposing the detailed data we need. While we do
have some options, most devices running upstream kernels don’t provide as
much — or as precise — information as we’d like. The most data-rich devices
tend to be phones or Android-based systems, which typically run slightly
older or patched kernels, adding yet another layer of complexity.

As a result, going from reviewing a patch series on LKML to having a concrete
good/bad/neutral result often involves several intermediate steps and tooling
hurdles.

Our current data collection relies heavily on existing kernel tracepoints and
trace events. However, adding new trace events is increasingly discouraged,
since these are often treated as part of a de facto userspace ABI — something
we want to avoid maintaining long-term. So extending the trace events set isn’t
a viable option.

To work around this, we use a kernel module (LISA) that defines its own trace
events based on existing scheduler tracepoints. This approach gives us
flexibility in creating events without modifying the kernel’s core trace
infrastructure or establishing any new userspace ABI.

For the past few years, tracepoint definitions for the scheduler have been
exposed in include/trace/events/sched.h. These definitions are not always
made available via tracefs, and are documented as being for testing and
debugging purposes — which aligns well with our use case.

However, this approach has limitations. One issue is the visibility of
tracepoint argument types. If a tracepoint uses a public type defined in a
public header, we can dereference members directly to extract data. But if
the type is internal or opaque — such as struct rq — we can’t access its
contents, which prevents us from retrieving useful values like the CPU number.

One workaround is to duplicate the kernel’s internal struct definitions in
the module, but this is not good: it’s error-prone due to alignment issues and
requires constant tracking of kernel changes to avoid mismatches.

A better approach, which we currently use, is to rely on BTF (BPF Type
Format) to reconstruct type information. BTF allows us to access internal
kernel types without having to maintain duplicate struct definitions. As long
as BTF info is available, we can introspect data structures even if they’re
not publicly defined.

Using this, our module can define trace events and dereference internal types
to extract data — but it’s not without friction:

Struct members are often nested deeply within BTF type trees, which can make
it awkward to navigate and extract data.

BTF describes data types, but not semantics. For example, sched_avg.util_est
appears to be a numeric value, but in reality it encodes a flag alongside the
actual utilization value. The kernel uses the following helper to extract the
actual data:

static inline unsigned long _task_util_est(struct task_struct *p)
{
    return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
}

There is no way to infer from BTF alone that this masking is needed. And even
when such helpers exist in the kernel, they’re often inlined or unavailable
to modules, so we’d have to reimplement them — again reintroducing
maintenance overhead.

To address these challenges and reduce duplication, we propose adding an
extra argument to certain scheduler tracepoints: a pointer to a struct of
function pointers (callbacks). These callbacks would act as "getters" that
the module could use to fetch internal data in a safe, forward-compatible
way.

For example, to extract the CPU capacity from a struct rq (which is opaque to
the module), the module could call a getter function via the callback struct.
These functions would reside inside the kernel, and could leverage internal
knowledge, including inlined helpers and static data.

Here's an example of the proposed callback structure:

struct sched_tp_callbacks {
    /* Fetches the util_est from a cfs_rq. */
    unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);

    /* Fetches the util_est from a sched_entity. */
    unsigned int (*se_util_est)(struct sched_entity *se);

    /* Fetches the current CPU capacity from an rq. */
    unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
};

The idea is simple: given a base type (e.g. rq, cfs_rq, sched_entity), the
module calls a getter function that returns the data it needs. These getters
encapsulate internal kernel logic and remove the need for the module to
replicate or guess how to access scheduler internals.

Since these additions would be part of tracepoints used for
testing/debugging, they are not considered stable ABI and can evolve as the
kernel changes. It would be up to the module to adapt to changes in available
hooks, types, or fields — something we already do today using BTF for
disappearing types (e.g. struct util_est becoming a raw integer).

While this approach would require some extra code in the kernel to define the
callback struct and register the functions, we believe it would significantly
improve testability and maintainability of tooling like LISA. It could even
be extended to support non-energy-aware scheduler debugging scenarios as
well.

Our current testing pipeline already makes heavy use of LISA [1], which
automates test execution and data analysis. It also integrates with rt-app
[2] to generate configurable workloads.

The attached proof-of-concept patch adds three such callback functions as a
demonstration. We’ve tested this against a modified version of our module
that uses the callbacks to fetch scheduler internals.

We’d appreciate any feedback on whether this general direction makes sense
and how it might be refined.

[1] https://tooling.sites.arm.com/lisa/latest/
[2] https://github.com/scheduler-tools/rt-app



Signed-off-by: Luis Machado <luis.machado@arm.com>

---

Updates on v2:

- Fix build errors due to missing cfs_rq's avg field and lack of arch-specific
  cpu capacity and cpu frequency hooks.
parent 07d49256
Hide whitespace changes
Inline Side-by-side
Please register or to comment