sched/events: Improving scheduler debugging/testing tracing interfaces
This patch is primarily a proposal aimed at improving the observability of the scheduler, especially in the context of energy-aware scheduling, without introducing long-term maintenance burdens such as a stable userspace ABI. I’m seeking feedback from the community on whether this approach seems viable, or if there are suggestions for making it more robust or maintainable. Today, validating that a set of scheduler changes behaves sanely and doesn’t regress performance or energy metrics can be time-consuming. On the energy-aware side in particular, this often requires significant manual intervention to collect, post-process, and analyze data. Another challenge is the limited availability of platforms that can run a mainline kernel while still exposing the detailed data we need. While we do have some options, most devices running upstream kernels don’t provide as much — or as precise — information as we’d like. The most data-rich devices tend to be phones or Android-based systems, which typically run slightly older or patched kernels, adding yet another layer of complexity. As a result, going from reviewing a patch series on LKML to having a concrete good/bad/neutral result often involves several intermediate steps and tooling hurdles. Our current data collection relies heavily on existing kernel tracepoints and trace events. However, adding new trace events is increasingly discouraged, since these are often treated as part of a de facto userspace ABI — something we want to avoid maintaining long-term. So extending the trace events set isn’t a viable option. To work around this, we use a kernel module (LISA) that defines its own trace events based on existing scheduler tracepoints. This approach gives us flexibility in creating events without modifying the kernel’s core trace infrastructure or establishing any new userspace ABI. For the past few years, tracepoint definitions for the scheduler have been exposed in include/trace/events/sched.h. These definitions are not always made available via tracefs, and are documented as being for testing and debugging purposes — which aligns well with our use case. However, this approach has limitations. One issue is the visibility of tracepoint argument types. If a tracepoint uses a public type defined in a public header, we can dereference members directly to extract data. But if the type is internal or opaque — such as struct rq — we can’t access its contents, which prevents us from retrieving useful values like the CPU number. One workaround is to duplicate the kernel’s internal struct definitions in the module, but this is not good: it’s error-prone due to alignment issues and requires constant tracking of kernel changes to avoid mismatches. A better approach, which we currently use, is to rely on BTF (BPF Type Format) to reconstruct type information. BTF allows us to access internal kernel types without having to maintain duplicate struct definitions. As long as BTF info is available, we can introspect data structures even if they’re not publicly defined. Using this, our module can define trace events and dereference internal types to extract data — but it’s not without friction: Struct members are often nested deeply within BTF type trees, which can make it awkward to navigate and extract data. BTF describes data types, but not semantics. For example, sched_avg.util_est appears to be a numeric value, but in reality it encodes a flag alongside the actual utilization value. The kernel uses the following helper to extract the actual data: static inline unsigned long _task_util_est(struct task_struct *p) { return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED; } There is no way to infer from BTF alone that this masking is needed. And even when such helpers exist in the kernel, they’re often inlined or unavailable to modules, so we’d have to reimplement them — again reintroducing maintenance overhead. To address these challenges and reduce duplication, we propose adding an extra argument to certain scheduler tracepoints: a pointer to a struct of function pointers (callbacks). These callbacks would act as "getters" that the module could use to fetch internal data in a safe, forward-compatible way. For example, to extract the CPU capacity from a struct rq (which is opaque to the module), the module could call a getter function via the callback struct. These functions would reside inside the kernel, and could leverage internal knowledge, including inlined helpers and static data. Here's an example of the proposed callback structure: struct sched_tp_callbacks { /* Fetches the util_est from a cfs_rq. */ unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq); /* Fetches the util_est from a sched_entity. */ unsigned int (*se_util_est)(struct sched_entity *se); /* Fetches the current CPU capacity from an rq. */ unsigned long (*rq_cpu_current_capacity)(struct rq *rq); }; The idea is simple: given a base type (e.g. rq, cfs_rq, sched_entity), the module calls a getter function that returns the data it needs. These getters encapsulate internal kernel logic and remove the need for the module to replicate or guess how to access scheduler internals. Since these additions would be part of tracepoints used for testing/debugging, they are not considered stable ABI and can evolve as the kernel changes. It would be up to the module to adapt to changes in available hooks, types, or fields — something we already do today using BTF for disappearing types (e.g. struct util_est becoming a raw integer). While this approach would require some extra code in the kernel to define the callback struct and register the functions, we believe it would significantly improve testability and maintainability of tooling like LISA. It could even be extended to support non-energy-aware scheduler debugging scenarios as well. Our current testing pipeline already makes heavy use of LISA [1], which automates test execution and data analysis. It also integrates with rt-app [2] to generate configurable workloads. The attached proof-of-concept patch adds three such callback functions as a demonstration. We’ve tested this against a modified version of our module that uses the callbacks to fetch scheduler internals. We’d appreciate any feedback on whether this general direction makes sense and how it might be refined. [1] https://tooling.sites.arm.com/lisa/latest/ [2] https://github.com/scheduler-tools/rt-app Signed-off-by:Luis Machado <luis.machado@arm.com> --- Updates on v2: - Fix build errors due to missing cfs_rq's avg field and lack of arch-specific cpu capacity and cpu frequency hooks.
Loading
Please register or sign in to comment