mm/mmu_gather: Store and process pages in contig ranges
mmu_gather accumulates a set of pages into a buffer for later rmap
removal and freeing. Page pointers were previously stored in a "linked
list of arrays", then at flush time, each page in the buffer was removed
from the rmap, removed from the swapcache and its refcount was
decremented; if the refcount reached 0, then it was freed.
With increasing numbers of large folios (or at least contiguous parts of
large folios) mapped into userspace processes (pagecache pages for
supporting filesystems currently, but in future also large anonymous
folios), we can measurably improve performance of process teardown:
- For rmap removal, we can batch-remove a range of pages belonging to
the same folio with folio_remove_rmap_range(), which is more efficient
because atomics can be manipulated just once per range. In the common
case, it also allows us to elide adding the (anon) folio to the
deferred split queue, only to remove it a bit later, once all pages of
the folio have been removed fro mthe rmap.
- For swapcache removal, we only need to check and remove the folio from
the swap cache once, rather than trying for each individual page.
- For page release, we can batch-decrement the refcount for each page in
the folio and free it if it hits zero.
Change the page pointer storage format within the mmu_gather batch
structure to store "pfn_range"s; a [start, end) pfn pair. This allows us
to run length encode a contiguous range of pages that all belong to the
same folio. This likely allows us to improve cache locality a bit. But
it also gives us a convenient format for implementing the above 3
optimizations.
Of course if running on a system that does not extensively use large
pte-mapped folios, then the RLE approach uses twice as much memory,
because each range is 1 page long and uses 2 words. But performance
measurements show no impact in terms of performance.
Macro Performance Results
-------------------------
Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
Configs: Comparing with and without large anon folios
Without large anon folios:
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.3% | -0.3% | 0.1% |
With large anon folios (order-3):
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.7% | -3.9% | -0.1% |
Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
Configs: Comparing with and without large anon folios
Without large anon folios:
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.9% | -2.9% | -0.6% |
With large anon folios (order-3):
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.4% | -3.7% | -0.2% |
Micro Performance Results
-------------------------
Flame graphs for kernel compilation on Ampere Altra show reduction in
cycles consumed by __arm64_sys_exit_group syscall:
Without large anon folios: -2%
With large anon folios: -26%
For the large anon folios case, it also shows a big difference in cost
of rmap removal:
baseline: cycles in page_remove_rmap(): 24.7B
mmugather-range: cycles in folio_remove_rmap_range(): 5.5B
Furthermore, the baseline shows 5.2B cycles used by
deferred_split_folio() which has completely disappeared after
applying this series.
Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com>
Loading
Please register or sign in to comment