arm64/mm: Implement new helpers to optimize fork()
With the core-mm changes in place to batch-copy ptes during fork, we can take advantage of this in arm64 to greatly reduce the number of tlbis we have to issue, and recover the lost fork performance incured when adding support for transparent contiguous ptes. This optimization covers 2 cases: 2) The memory being CoWed is contpte-sized (or bigger) folios. We set wrprotect in the parent and set the ptes in the child for a whole contpte block in one hit. This means we can operate on the whole block and don't need to unfold/fold. 1) The memory being CoWed is all order-0 folios. No folding or unfolding occurs here, but the added cost of checking if we need to fold on every pte adds up. Given we are forking, we are just copying the ptes already in the parent, so we should be maintaining the single/contpte state into the child anyway, and any check for folding will always be false. Therefore, we can elide the fold check in set_ptes_full() and ptep_set_wrprotects() when full=1. The optimization to wrprotect a whole contpte block without unfolding is possible thanks to the tightening of the Arm ARM in respect to the definition and behaviour when 'Misprogramming the Contiguous bit'. See section D21194 at https://developer.arm.com/documentation/102105/latest/ The following microbenchmark results demonstate the recovered (and overall improved) fork performance for large pte-mapped folios once this patch is applied. Fork is called in a tight loop in a process with 1G of populated memory and the time for the function to execute is measured. 100 iterations per run, 8 runs performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests performed for case where 1G memory is comprised of pte-mapped order-9 folios. Negative is faster, positive is slower, compared to baseline upon which the series is based: | fork | Apple M2 VM | Ampere Altra | | order-9 |-------------------|-------------------| | (pte-map) | mean | stdev | mean | stdev | |---------------|---------|---------|---------|---------| | baseline | 0.0% | 1.2% | 0.0% | 0.1% | | before-change | 541.5% | 2.8% | 3654.4% | 0.0% | | after-change | -25.4% | 1.9% | -6.7% | 0.1% | Tested-by:John Hubbard <jhubbard@nvidia.com> Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com>
Loading
Please register or sign in to comment