mm: swap: Swap-out small-sized THP without splitting
The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.
Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device.
These are the same constraints as for the existing PMD-sized THP
swap-out support.
Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.
The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller falls back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).
The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.
__scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
for order-0.
As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.
I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
device as the swap device and from inside a memcg limited to 40G memory.
I've then run `usemem` from vm-scalability with 70 processes (each has
its own core), each allocating and writing 1G of memory. I've repeated
everything 5 times and taken the mean:
Mean Performance Improvement vs 4K/baseline
| alloc size | baseline | + this series |
| | v6.6-rc4+anonfolio | |
|:-----------|--------------------:|--------------------:|
| 4K Page | 0.0% | 4.9% |
| 64K THP | -44.1% | 10.7% |
| 2M THP | 56.0% | 65.9% |
So with this change, the regression for 64K swap performance goes away
and 4K and 2M swap improves slightly too.
Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com>
Loading
Please register or sign in to comment