New cache callback recommendations - do they work?

I'm busy updating our reference code to the latest MLEK and core driver (for non-MLEK examples). I'm a bit concerned about the direction of the cache changes I see.

The latest docs suggest not implementing ethosu_flush_dcache:

It is recommended to not implement this function but have the user application make sure that IFM data has been written to memory before invoking an inference on the NPU.

That seems like it would be a layering violation - a portable TFLu application has to become aware that it is running on a system with a hardware accelerator that needs the clean. This relates to my ethosu_config_select() path which is also trying to keep application code portable between TFLu and TFLu-with-Ethos.

But worse, I believe that omitting ethosu_flush_dcache will only work with a fully-accelerated model. If its a mixed software/Ethos model, with writeback caching, you're going to need a clean of any output from software operators before going back into the NPU. A pre-inference application clean can't cover that.

And an architectural correctness problem:

the invalidation call is done before waiting for the NPU to finish the inference so that depending on the network, the cycles for invalidating the cache may be completely hidden

Smart, but unfortunately not safe. As well as requiring that no software reads that area while the NPU is running, it's architecturally permitted for the CPU to speculatively load something from the area into the cache while the NPU is running. You technically need to invalidate the area address range after the NPU has finished.

In practice I'm sure we'll almost always get away it, as in the M55 and M85 speculative cache loads have only happen immediately after a sequential access pattern, and it's unlikely anyone is reading something immediately before (or in) the output area, but I don't like something not totally safe being the reference implementation.

And on top of that, you also need to clean or invalidate NPU output and scratch areas before the NPU starts, if there is any possibility that there is a dirty cache line that could get written back to the area while the NPU is writing. Main danger there would be a software operator having previously written to an arena area that gets reused for the Ethos operator. In that case I don't actually know whether the existing ranged ethosu_flush_dcache calls for NPU inputs+scratch would always manage to incidentally cover the NPU output areas. Doubly so with the new bitmasks.

The ML evaluation kit currently retains both hooks, keeping mixed models safe for cleaning, and has its own optimisation to only do a single global cache op before and after each NPU invocation (hence not really caring about the region bitmask, and avoiding any worries about dirty output areas). A global op tends to be more efficient - my tests suggested that with a 32KB cache, it's not worth doing a ranged operation bigger than about 100KB.

And on our local branch we've extended that to have address-based checks for whether an area needs cleaning and/or invalidating (optimising TCMs, write-through areas and non-volatile memory). Which has allowed us to avoid the cleans by making sure no Ethos-accessed areas are write-back. Writeback vs writethrough really doesn't seem to matter much. And everything being address-based and through driver hooks means the application/use-case code don't have to know the details, as previously mentioned.

I think that MLEK approach combined with write-through caching seems like the best compromise of reasonable performance while being architecturally correct (if the invalidate hook is called after NPU completion) - but it's rather fighting the current core driver hook design.

To the extent I think the cache call reordering may have just broken the assumptions of its state machine?

I think if I were to redesign it, I'd have the core driver "bulk" cache callbacks made before and after each NPU invocation, with the full region list, letting smart hooks see the whole picture in one go before choosing what to do, rather than having to create a state machine to reverse engineer from the per-region calls back to a per-invocation view as the MLEK does.

And the default/example callbacks could just ignore the region info and do global Clean before and CleanInvalidate after, like the MLEK default. (Conditionally on the cache being enabled, and you could have the Clean conditional on the M55+M85's sticky MSCR.DCCLEAN flag - any system that's never used writeback gets it skipped automatically). That default would really work quite well - much better than the previous basic 5-6 ranged ops.

(Upstream MLEK currently does incorrectly do global Invalidate after though - I now recall patching that locally to CleanInvalidate but not upstreaming it).

These per-invocation cache calls could actually be added and exist in parallel with the existing hooks. Users just need to not implement both.

The per-invocation calls could also include a "post-start, pre-wait" hook to implement the invalidate-during-inference concept, for anyone who wants to live dangerously, or has managed to make it safe eg with extra messing with MPU cacheability settings.