arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
With the core-mm changes in place to batch-copy ptes during fork, we can take advantage of this in arm64 to greatly reduce the number of tlbis we have to issue, and recover the lost fork performance incured when adding support for transparent contiguous ptes. If we are write-protecting a whole contig range, we can apply the write-protection to the whole range and know that it won't change whether the range should have the contiguous bit set or not. For ranges smaller than the contig range, we will still have to unfold, apply the write-protection, then fold if the change now means the range is foldable. This optimization is possible thanks to the tightening of the Arm ARM in respect to the definition and behaviour when 'Misprogramming the Contiguous bit'. See section D21194 at https://developer.arm.com/documentation/102105/latest/ Performance tested with the following test written for the will-it-scale framework: ------- char *testcase_description = "fork and exit"; void testcase(unsigned long long *iterations, unsigned long nr) { int pid; char *mem; mem = malloc(SZ_128M); assert(mem); memset(mem, 1, SZ_128M); while (1) { pid = fork(); assert(pid >= 0); if (!pid) exit(0); waitpid(pid, NULL, 0); (*iterations)++; } } ------- I see huge performance regression when PTE_CONT support was added, then the regression is mostly fixed with the addition of this change. The following shows regression relative to before PTE_CONT was enabled (bigger negative value is bigger regression): | cpus | before opt | after opt | |-------:|-------------:|------------:| | 1 | -10.4% | -5.2% | | 8 | -15.4% | -3.5% | | 16 | -38.7% | -3.7% | | 24 | -57.0% | -4.4% | | 32 | -65.8% | -5.4% | Signed-off-by:Ryan Roberts <ryan.roberts@arm.com>
Loading
Please register or sign in to comment