mm/mmu_gather: Store and process pages in contig ranges (240dc858) · Commits · linux-arm / linux-rr

Commit 240dc858 authored Aug 30, 2023 by Ryan Roberts
mm/mmu_gather: Store and process pages in contig ranges



mmu_gather accumulates a set of pages into a buffer for later rmap
removal and freeing. Page pointers were previously stored in a "linked
list of arrays", then at flush time, each page in the buffer was removed
from the rmap, removed from the swapcache and its refcount was
decremented; if the refcount reached 0, then it was freed.

With increasing numbers of large folios (or at least contiguous parts of
large folios) mapped into userspace processes (pagecache pages for
supporting filesystems currently, but in future also large anonymous
folios), we can measurably improve performance of process teardown:

- For rmap removal, we can batch-remove a range of pages belonging to
  the same folio with folio_remove_rmap_range(), which is more efficient
  because atomics can be manipulated just once per range. In the common
  case, it also allows us to elide adding the (anon) folio to the
  deferred split queue, only to remove it a bit later, once all pages of
  the folio have been removed fro mthe rmap.

- For swapcache removal, we only need to check and remove the folio from
  the swap cache once, rather than trying for each individual page.

- For page release, we can batch-decrement the refcount for each page in
  the folio and free it if it hits zero.

Change the page pointer storage format within the mmu_gather batch
structure to store "pfn_range"s; a [start, end) pfn pair. This allows us
to run length encode a contiguous range of pages that all belong to the
same folio. This likely allows us to improve cache locality a bit. But
it also gives us a convenient format for implementing the above 3
optimizations.

Of course if running on a system that does not extensively use large
pte-mapped folios, then the RLE approach uses twice as much memory,
because each range is 1 page long and uses 2 words. But performance
measurements show no impact in terms of performance.

Macro Performance Results
-------------------------

Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
Configs: Comparing with and without large anon folios

Without large anon folios:
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off |        0.0% |        0.0% |        0.0% |
| mmugather-range  |       -0.3% |       -0.3% |        0.1% |

With large anon folios (order-3):
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on  |        0.0% |        0.0% |        0.0% |
| mmugather-range  |       -0.7% |       -3.9% |       -0.1% |

Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
Configs: Comparing with and without large anon folios

Without large anon folios:
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off |        0.0% |        0.0% |        0.0% |
| mmugather-range  |       -0.9% |       -2.9% |       -0.6% |

With large anon folios (order-3):
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on  |        0.0% |        0.0% |        0.0% |
| mmugather-range  |       -0.4% |       -3.7% |       -0.2% |

Micro Performance Results
-------------------------

Flame graphs for kernel compilation on Ampere Altra show reduction in
cycles consumed by __arm64_sys_exit_group syscall:

    Without large anon folios: -2%
    With large anon folios:    -26%

For the large anon folios case, it also shows a big difference in cost
of rmap removal:

   baseline: cycles in page_remove_rmap(): 24.7B
   mmugather-range: cycles in folio_remove_rmap_range(): 5.5B

Furthermore, the baseline shows 5.2B cycles used by
deferred_split_folio() which has completely disappeared after
applying this series.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
parent 9e24ad02
Hide whitespace changes
Inline Side-by-side
Please register or to comment