mm/filemap: Allow arch to request folio size for exec memory
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read into a set of folios with an arch-specified order and in a naturally aligned manner. We no longer center the read on the faulting page but simply align it down to the previous natural boundary. Additionally, we don't bother with an asynchronous part. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-0 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-0. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential for making reclaim more difficult (due to the folios being larger so if a part of the folio is hot the whole thing is considered hot). But executable memory is a small portion of the overall system memory so I doubt this will even register from a reclaim perspective. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs. Crucially the same amount of data is still read (usually 128K) so I'm not expecting any read amplification issues. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Confine this special-case approach to the bounds of the VMA. This prevents wasting memory for any padding that might exist in the file between sections. Previously the padding would have been contained in order-0 folios and would be easy to reclaim. But now it would be part of a larger folio so more difficult to reclaim. Solve this by simply not reading it into memory in the first place. Benchmarking ============ TODO: NUMBERS ARE FOR V3 OF SERIES. NEED TO RERUN FOR THIS VERSION. The below shows nginx and redis benchmarks on Ampere Altra arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: | File-backed folios | system boot | nginx | redis | | by size as percentage |-----------------|-----------------|-----------------| | of all mapped text mem | before | after | before | after | before | after | |========================|========|========|========|========|========|========| | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% | | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% | | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% | | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% | | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% | | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% | | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% | | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% | |------------------------|--------|--------|--------|--------|--------|--------| | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% | The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement: | Benchmark | Improvement | +===============================================+======================+ | pts/nginx (200 connections) | 8.96% | | pts/nginx (1000 connections) | 6.80% | +-----------------------------------------------+----------------------+ | pts/redis (LPOP, 50 connections) | 5.07% | | pts/redis (LPUSH, 50 connections) | 3.68% | Reviewed-by:Jan Kara <jack@suse.cz> Acked-by:
Will Deacon <will@kernel.org> Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com>
Loading
Please register or sign in to comment