mm/mmu_gather: Gather and process pages in contig folio regions (c406ed62) · Commits · linux-arm / linux-rr

Commit c406ed62 authored Sep 01, 2023 by Ryan Roberts
mm/mmu_gather: Gather and process pages in contig folio regions



mmu_gather accumulates a set of pages into a buffer for later rmap
removal and freeing. Page pointers were previously stored in a "linked
list of arrays", then at flush time, each page in the buffer was removed
from the rmap, removed from the swapcache and its refcount was
decremented; if the refcount reached 0, then it was freed.

With increasing numbers of large folios (or at least contiguous parts of
large folios) mapped into userspace processes (pagecache pages for
supporting filesystems currently, but in future also large anonymous
folios), we can measurably improve performance of process teardown:

- For rmap removal, we can batch-remove a range of pages belonging to
  the same folio with folio_remove_rmap_range(), which is more efficient
  because atomics can be manipulated just once per range. In the common
  case, it also allows us to elide adding the (anon) folio to the
  deferred split queue, only to remove it a bit later, once all pages of
  the folio have been removed from the rmap.

- For swapcache removal, we only need to check and remove the folio from
  the swap cache once, rather than trying for each individual page.

- For page release, we can batch-decrement the refcount for each page in
  the folio and free it if it hits zero.

Change the page pointer storage format within the mmu_gather batch
structure to store "folio_regions"s; a [start, end) pfn pair. This
allows us to run length encode a contiguous region of pages that all
belong to the same folio. This likely allows us to improve cache
locality a bit. But it also gives us a convenient format for
implementing the above 3 optimizations.

Of course if running on a system that does not extensively use large
pte-mapped folios, then the RLE approach uses twice as much memory,
because each region is 1 page long and uses 2 words. But performance
measurements show no impact in terms of performance.

mmu_gather was the only user of free_pages_and_swap_cache() so this
function has been removed and replaced with
free_folio_regions_and_swap_cache(), which does everything we need
directly. While it currently duplicates similar logic to
release_pages(), an upcoming series from Matthew Wilcox will both remove
this duplication and improve performance of the lower level folio
freeing. free_folio_regions_and_swap_cache() is the common interface for
that.

Macro Performance Results
-------------------------

Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
Configs: Comparing with and without large anon folios

Without large anon folios:
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off |        0.0% |        0.0% |        0.0% |
| mmugather-region |       -0.3% |       -0.3% |        0.1% |

With large anon folios (order-3):
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on  |        0.0% |        0.0% |        0.0% |
| mmugather-region |       -0.7% |       -3.9% |       -0.1% |

Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
Configs: Comparing with and without large anon folios

Without large anon folios:
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off |        0.0% |        0.0% |        0.0% |
| mmugather-region |       -0.9% |       -2.9% |       -0.6% |

With large anon folios (order-3):
| kernel           |   real-time |   kern-time |   user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on  |        0.0% |        0.0% |        0.0% |
| mmugather-region |       -0.4% |       -3.7% |       -0.2% |

Micro Performance Results
-------------------------

Flame graphs for kernel compilation on Ampere Altra show reduction in
cycles consumed by __arm64_sys_exit_group syscall:

    Without large anon folios: -2%
    With large anon folios:    -26%

For the large anon folios case, it also shows a big difference in cost
of rmap removal:

   baseline: cycles in page_remove_rmap(): 24.7B
   mmugather-region: cycles in folio_remove_rmap_range(): 5.5B

Furthermore, the baseline shows 5.2B cycles used by
deferred_split_folio() which has completely disappeared after
applying this series.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
parent 6c143c81
Hide whitespace changes
Inline Side-by-side
Please register or to comment