KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying
Mark pages/folios dirty when creating SPTEs to map PFNs into the guest, not when zapping or modifying SPTEs, as marking folios dirty when zapping or modifying SPTEs can be extremely inefficient. E.g. when KVM is zapping collapsible SPTEs to reconstitute a hugepage after disbling dirty logging, KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are guaranteed to be in a single folio (the SPTE couldn't potentially be huge if that weren't the case). The problem only becomes worse for 1GiB HugeTLB pages, as KVM can mark a single folio dirty 512*512 times. Marking a folio dirty when mapping is functionally safe as KVM drops all relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures that the guest can't dirty a folio after access has been removed. And because KVM already marks folios dirty when zapping/modifying SPTEs for KVM reasons, i.e. not in response to an mmu_notifier invalidation, there is no danger of "prematurely" marking a folio dirty. E.g. if a filesystems cleans a folio without first removing write access, then there already exists races where KVM could mark a folio dirty before remote TLBs are flushed, i.e. before guest writes are guaranteed to stop. Furthermore, x86 is literally the only architecture that marks folios dirty on the backend; every other KVM architecture marks folios dirty at map time. x86's unique behavior likely stems from the fact that x86's MMU predates mmu_notifiers. Long, long ago, before mmu_notifiers were added, marking pages dirty when zapping SPTEs was logical, and perhaps even necessary, as KVM held references to pages, i.e. kept a page's refcount elevated while the page was mapped into the guest. At the time, KVM's rmap_remove() simply did: if (is_writeble_pte(*spte)) kvm_release_pfn_dirty(pfn); else kvm_release_pfn_clean(pfn); i.e. dropped the refcount and marked the page dirty at the same time. After mmu_notifiers were introduced, commit acb66dd0 ("KVM: MMU: don't hold pagecount reference for mapped sptes pages") removed the refcount logic, but kept the dirty logic, i.e. converted the above to: if (is_writeble_pte(*spte)) kvm_release_pfn_dirty(pfn); And for KVM x86, that's essentially how things have stayed over the last ~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at zap/modification time, e.g. the behavior was blindly carried forward to the TDP MMU. Practically speaking, the only downside to marking a folio dirty during mapping is that KVM could trigger writeback of memory that was never actually written. Except that can't actually happen if KVM marks folios dirty if and only if a writable SPTE is created (as done here), because KVM always marks writable SPTEs as dirty during make_spte(). See commit 9b51a630 ("KVM: MMU: Explicitly set D-bit for writable spte."), circa 2015. Note, KVM's access tracking logic for prefetched SPTEs is a bit odd. If a guest PTE is dirty and writable, KVM will create a writable SPTE, but then mark the SPTE for access tracking. Which isn't wrong, just a bit odd, as it results in _more_ precise dirty tracking for MMUs _without_ A/D bits. To keep things simple, mark the folio dirty before access tracking comes into play, as an access-tracked SPTE can be restored in the fast page fault path, i.e. without holding mmu_lock. While writing SPTEs and accessing memslots outside of mmu_lock is safe, marking a folio dirty is not. E.g. if the fast path gets interrupted _just_ after setting a SPTE, the primary MMU could theoretically invalidate and free a folio before KVM marks it dirty. Unlike the shadow MMU, which waits for CPUs to respond to an IPI, the TDP MMU only guarantees the page tables themselves won't be freed (via RCU). Opportunistically update a few stale comments. Cc: David Matlack <dmatlack@google.com> Tested-by:Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-9-seanjc@google.com>
Loading
Please register or sign in to comment