arm64/mm: Implement new helpers to optimize fork() (92b4c605) · Commits · linux-arm / linux-rr

Commit 92b4c605 authored Dec 11, 2023 by Ryan Roberts

arm64/mm: Implement new helpers to optimize fork()

With the core-mm changes in place to batch-copy ptes during fork, we can
take advantage of this in arm64 to greatly reduce the number of tlbis we
have to issue, and recover the lost fork performance incured when adding
support for transparent contiguous ptes.

This optimization covers 2 cases:

2) The memory being CoWed is contpte-sized (or bigger) folios. We set
   wrprotect in the parent and set the ptes in the child for a whole
   contpte block in one hit. This means we can operate on the whole
   block and don't need to unfold/fold.

1) The memory being CoWed is all order-0 folios. No folding or unfolding
   occurs here, but the added cost of checking if we need to fold on
   every pte adds up. Given we are forking, we are just copying the ptes
   already in the parent, so we should be maintaining the single/contpte
   state into the child anyway, and any check for folding will always be
   false. Therefore, we can elide the fold check in set_ptes_full() and
   ptep_set_wrprotects() when full=1.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/latest/



The following microbenchmark results demonstate the recovered (and
overall improved) fork performance for large pte-mapped folios once this
patch is applied. Fork is called in a tight loop in a process with 1G of
populated memory and the time for the function to execute is measured.
100 iterations per run, 8 runs performed on both Apple M2 (VM) and
Ampere Altra (bare metal). Tests performed for case where 1G memory is
comprised of pte-mapped order-9 folios. Negative is faster, positive is
slower, compared to baseline upon which the series is based:

| fork          |    Apple M2 VM    |    Ampere Altra   |
| order-9       |-------------------|-------------------|
| (pte-map)     |    mean |   stdev |    mean |   stdev |
|---------------|---------|---------|---------|---------|
| baseline      |    0.0% |    1.2% |    0.0% |    0.1% |
| before-change |  541.5% |    2.8% | 3654.4% |    0.0% |
| after-change  |  -25.4% |    1.9% |   -6.7% |    0.1% |

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

parent dc3ce25a

Hide whitespace changes

Inline Side-by-side

Please register or to comment