Documentation: kvm: organize capabilities in the right section

Categorize the capabilities correctly.  Section 6 is for enabled vCPU
capabilities; section 7 is for enabled VM capabilities; section 8 is
for informational ones.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit is contained in:
Paolo Bonzini
2025-04-01 14:57:39 +02:00
parent ed7974fd59
commit af339282e2

View File

@@ -7447,6 +7447,75 @@ Unused bitfields in the bitarrays must be set to zero.
This capability connects the vcpu to an in-kernel XIVE device.
6.76 KVM_CAP_HYPERV_SYNIC
-------------------------
:Architectures: x86
:Target: vcpu
This capability, if KVM_CHECK_EXTENSION indicates that it is
available, means that the kernel has an implementation of the
Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is
used to support Windows Hyper-V based guest paravirt drivers(VMBus).
In order to use SynIC, it has to be activated by setting this
capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this
will disable the use of APIC hardware virtualization even if supported
by the CPU, as it's incompatible with SynIC auto-EOI behavior.
6.77 KVM_CAP_HYPERV_SYNIC2
--------------------------
:Architectures: x86
:Target: vcpu
This capability enables a newer version of Hyper-V Synthetic interrupt
controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM
doesn't clear SynIC message and event flags pages when they are enabled by
writing to the respective MSRs.
6.78 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
-----------------------------------
:Architectures: x86
:Target: vcpu
This capability indicates that KVM running on top of Hyper-V hypervisor
enables Direct TLB flush for its guests meaning that TLB flush
hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM.
Due to the different ABI for hypercall parameters between Hyper-V and
KVM, enabling this capability effectively disables all hypercall
handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
flush hypercalls by Hyper-V) so userspace should disable KVM identification
in CPUID and only exposes Hyper-V identification. In this case, guest
thinks it's running on Hyper-V and only use Hyper-V hypercalls.
6.79 KVM_CAP_HYPERV_ENFORCE_CPUID
---------------------------------
:Architectures: x86
:Target: vcpu
When enabled, KVM will disable emulated Hyper-V features provided to the
guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
currently implemented Hyper-V features are provided unconditionally when
Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
leaf.
6.80 KVM_CAP_ENFORCE_PV_FEATURE_CPUID
-------------------------------------
:Architectures: x86
:Target: vcpu
When enabled, KVM will disable paravirtual features provided to the
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.
.. _KVM_CAP_DIRTY_LOG_RING:
.. _cap_enable_vm:
7. Capabilities that can be enabled on VMs
@@ -7963,23 +8032,6 @@ default.
See Documentation/arch/x86/sgx.rst for more details.
7.26 KVM_CAP_PPC_RPT_INVALIDATE
-------------------------------
:Architectures: ppc
:Type: vm
This capability indicates that the kernel is capable of handling
H_RPT_INVALIDATE hcall.
In order to enable the use of H_RPT_INVALIDATE in the guest,
user space might have to advertise it for the guest. For example,
IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
present in the "ibm,hypertas-functions" device-tree property.
This capability is enabled for hypervisors on platforms like POWER9
that support radix MMU.
7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE
--------------------------------------
@@ -8037,19 +8089,6 @@ indicated by the fd to the VM this is called on.
This is intended to support intra-host migration of VMs between userspace VMMs,
upgrading the VMM process without interrupting the guest.
7.30 KVM_CAP_PPC_AIL_MODE_3
-------------------------------
:Architectures: ppc
:Type: vm
This capability indicates that the kernel supports the mode 3 setting for the
"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location"
resource that is controlled with the H_SET_MODE hypercall.
This capability allows a guest kernel to use a better-performance mode for
handling interrupts and system calls.
7.31 KVM_CAP_DISABLE_QUIRKS2
----------------------------
@@ -8207,27 +8246,6 @@ This capability is aimed to mitigate the threat that malicious VMs can
cause CPU stuck (due to event windows don't open up) and make the CPU
unavailable to host or other VMs.
7.34 KVM_CAP_MEMORY_FAULT_INFO
------------------------------
:Architectures: x86
:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
The presence of this capability indicates that KVM_RUN will fill
kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
there is a valid memslot but no backing VMA for the corresponding host virtual
address.
The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
to KVM_EXIT_MEMORY_FAULT.
Note: Userspaces which attempt to resolve memory faults so that they can retry
KVM_RUN are encouraged to guard against repeatedly receiving the same
error/annotated fault.
See KVM_EXIT_MEMORY_FAULT for more information.
7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS
-----------------------------------
@@ -8245,19 +8263,220 @@ by KVM_CHECK_EXTENSION.
Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the
core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.
7.36 KVM_CAP_X86_GUEST_MODE
------------------------------
7.36 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
----------------------------------------------------------
:Architectures: x86, arm64
:Type: vm
:Parameters: args[0] - size of the dirty log ring
KVM is capable of tracking dirty memory using ring buffers that are
mmapped into userspace; there is one dirty ring per vcpu.
The dirty ring is available to userspace as an array of
``struct kvm_dirty_gfn``. Each dirty entry is defined as::
struct kvm_dirty_gfn {
__u32 flags;
__u32 slot; /* as_id | slot_id */
__u64 offset;
};
The following values are defined for the flags field to define the
current state of the entry::
#define KVM_DIRTY_GFN_F_DIRTY BIT(0)
#define KVM_DIRTY_GFN_F_RESET BIT(1)
#define KVM_DIRTY_GFN_F_MASK 0x3
Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
ioctl to enable this capability for the new guest and set the size of
the rings. Enabling the capability is only allowed before creating any
vCPU, and the size of the ring must be a power of two. The larger the
ring buffer, the less likely the ring is full and the VM is forced to
exit to userspace. The optimal size depends on the workload, but it is
recommended that it be at least 64 KiB (4096 entries).
Just like for dirty page bitmaps, the buffer tracks writes to
all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
with the flag set, userspace can start harvesting dirty pages from the
ring buffer.
An entry in the ring buffer can be unused (flag bits ``00``),
dirty (flag bits ``01``) or harvested (flag bits ``1X``). The
state machine for the entry is as follows::
dirtied harvested reset
00 -----------> 01 -------------> 1X -------+
^ |
| |
+------------------------------------------+
To harvest the dirty pages, userspace accesses the mmapped ring buffer
to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
The userspace should harvest this GFN and mark the flags from state
``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
to show that this GFN is harvested and waiting for a reset), and move
on to the next GFN. The userspace should continue to do this until the
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
all the dirty GFNs that were available.
Note that on weakly ordered architectures, userspace accesses to the
ring buffer (and more specifically the 'flags' field) must be ordered,
using load-acquire/store-release accessors when available, or any
other memory barrier that will ensure this ordering.
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
After processing one or more entries in the ring buffer, userspace
calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
it, so that the kernel will reprotect those collected GFNs.
Therefore, the ioctl must be called *before* reading the content of
the dirty pages.
The dirty ring can get full. When it happens, the KVM_RUN of the
vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
The dirty ring interface has a major difference comparing to the
KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
userspace, it's still possible that the kernel has not yet flushed the
processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal. The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
should be exposed by weakly ordered architecture, in order to indicate
the additional memory ordering requirements imposed on userspace when
reading the state of an entry and mutating it from DIRTY to HARVESTED.
Architecture with TSO-like ordering (such as x86) are allowed to
expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
to userspace.
After enabling the dirty rings, the userspace needs to detect the
capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the
ring structures can be backed by per-slot bitmaps. With this capability
advertised, it means the architecture can dirty guest pages without
vcpu/ring context, so that some of the dirty information will still be
maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL
hasn't been enabled, or any memslot has been existing.
Note that the bitmap here is only a backup of the ring structure. The
use of the ring and bitmap combination is only beneficial if there is
only a very small amount of memory that is dirtied out of vcpu/ring
context. Otherwise, the stand-alone per-slot bitmap mechanism needs to
be considered.
To collect dirty bits in the backup bitmap, userspace can use the same
KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all
the generation of the dirty bits is done in a single pass. Collecting
the dirty bitmap should be the very last thing that the VMM does before
considering the state as complete. VMM needs to ensure that the dirty
state is final and avoid missing dirty pages from another ioctl ordered
after the bitmap collection.
NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its
tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on
KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through
command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device
"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save
vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES}
command on KVM device "kvm-arm-vgic-v3".
7.37 KVM_CAP_PMU_CAPABILITY
---------------------------
:Architectures: x86
:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
:Type: vm
:Parameters: arg[0] is bitmask of PMU virtualization capabilities.
:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits
The presence of this capability indicates that KVM_RUN will update the
KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the
vCPU was executing nested guest code when it exited.
This capability alters PMU virtualization in KVM.
KVM exits with the register state of either the L1 or L2 guest
depending on which executed at the time of an exit. Userspace must
take care to differentiate between these cases.
Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of
PMU virtualization capabilities that can be adjusted on a VM.
The argument to KVM_ENABLE_CAP is also a bitmask and selects specific
PMU virtualization capabilities to be applied to the VM. This can
only be invoked on a VM prior to the creation of VCPUs.
At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting
this capability will disable PMU virtualization for that VM. Usermode
should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
7.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
-------------------------------------
:Architectures: x86
:Type: vm
:Parameters: arg[0] must be 0.
:Returns: 0 on success, -EPERM if the userspace process does not
have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been
created.
This capability disables the NX huge pages mitigation for iTLB MULTIHIT.
The capability has no effect if the nx_huge_pages module parameter is not set.
This capability may only be set before any vCPUs are created.
7.39 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
---------------------------------------
:Architectures: arm64
:Type: vm
:Parameters: arg[0] is the new split chunk size.
:Returns: 0 on success, -EINVAL if any memslot was already created.
This capability sets the chunk size used in Eager Page Splitting.
Eager Page Splitting improves the performance of dirty-logging (used
in live migrations) when guest memory is backed by huge-pages. It
avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
it eagerly when enabling dirty logging (with the
KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
KVM_CLEAR_DIRTY_LOG.
The chunk size specifies how many pages to break at a time, using a
single allocation for each chunk. Bigger the chunk size, more pages
need to be allocated ahead of time.
The chunk size needs to be a valid block size. The list of acceptable
block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
64-bit bitmap (each bit describing a block size). The default value is
0, to disable the eager page splitting.
7.40 KVM_CAP_EXIT_HYPERCALL
---------------------------
:Architectures: x86
:Type: vm
This capability, if enabled, will cause KVM to exit to userspace
with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.
Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
of hypercalls that can be configured to exit to userspace.
Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.
The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace
the hypercalls whose corresponding bit is in the argument, and return
ENOSYS for the others.
7.41 KVM_CAP_ARM_SYSTEM_SUSPEND
-------------------------------
:Architectures: arm64
:Type: vm
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
-------------------------------------
@@ -8294,21 +8513,6 @@ H_RANDOM hypercall backed by a hardware random-number generator.
If present, the kernel H_RANDOM handler can be enabled for guest use
with the KVM_CAP_PPC_ENABLE_HCALL capability.
8.2 KVM_CAP_HYPERV_SYNIC
------------------------
:Architectures: x86
This capability, if KVM_CHECK_EXTENSION indicates that it is
available, means that the kernel has an implementation of the
Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is
used to support Windows Hyper-V based guest paravirt drivers(VMBus).
In order to use SynIC, it has to be activated by setting this
capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this
will disable the use of APIC hardware virtualization even if supported
by the CPU, as it's incompatible with SynIC auto-EOI behavior.
8.3 KVM_CAP_PPC_MMU_RADIX
-------------------------
@@ -8454,16 +8658,6 @@ virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N
(counting from the right) is set, then a virtual SMT mode of 2^N is
available.
8.11 KVM_CAP_HYPERV_SYNIC2
--------------------------
:Architectures: x86
This capability enables a newer version of Hyper-V Synthetic interrupt
controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM
doesn't clear SynIC message and event flags pages when they are enabled by
writing to the respective MSRs.
8.12 KVM_CAP_HYPERV_VP_INDEX
----------------------------
@@ -8478,7 +8672,6 @@ capability is absent, userspace can still query this msr's value.
-------------------------------
:Architectures: s390
:Parameters: none
This capability indicates if the flic device will be able to get/set the
AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
@@ -8552,21 +8745,6 @@ This capability indicates that KVM supports paravirtualized Hyper-V IPI send
hypercalls:
HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
-----------------------------------
:Architectures: x86
This capability indicates that KVM running on top of Hyper-V hypervisor
enables Direct TLB flush for its guests meaning that TLB flush
hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM.
Due to the different ABI for hypercall parameters between Hyper-V and
KVM, enabling this capability effectively disables all hypercall
handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
flush hypercalls by Hyper-V) so userspace should disable KVM identification
in CPUID and only exposes Hyper-V identification. In this case, guest
thinks it's running on Hyper-V and only use Hyper-V hypercalls.
8.22 KVM_CAP_S390_VCPU_RESETS
-----------------------------
@@ -8644,142 +8822,6 @@ In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
trap and emulate MSRs that are outside of the scope of KVM as well as
limit the attack surface on KVM's MSR emulation code.
8.28 KVM_CAP_ENFORCE_PV_FEATURE_CPUID
-------------------------------------
:Architectures: x86
When enabled, KVM will disable paravirtual features provided to the
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.
.. _KVM_CAP_DIRTY_LOG_RING:
8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
----------------------------------------------------------
:Architectures: x86, arm64
:Parameters: args[0] - size of the dirty log ring
KVM is capable of tracking dirty memory using ring buffers that are
mmapped into userspace; there is one dirty ring per vcpu.
The dirty ring is available to userspace as an array of
``struct kvm_dirty_gfn``. Each dirty entry is defined as::
struct kvm_dirty_gfn {
__u32 flags;
__u32 slot; /* as_id | slot_id */
__u64 offset;
};
The following values are defined for the flags field to define the
current state of the entry::
#define KVM_DIRTY_GFN_F_DIRTY BIT(0)
#define KVM_DIRTY_GFN_F_RESET BIT(1)
#define KVM_DIRTY_GFN_F_MASK 0x3
Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
ioctl to enable this capability for the new guest and set the size of
the rings. Enabling the capability is only allowed before creating any
vCPU, and the size of the ring must be a power of two. The larger the
ring buffer, the less likely the ring is full and the VM is forced to
exit to userspace. The optimal size depends on the workload, but it is
recommended that it be at least 64 KiB (4096 entries).
Just like for dirty page bitmaps, the buffer tracks writes to
all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
with the flag set, userspace can start harvesting dirty pages from the
ring buffer.
An entry in the ring buffer can be unused (flag bits ``00``),
dirty (flag bits ``01``) or harvested (flag bits ``1X``). The
state machine for the entry is as follows::
dirtied harvested reset
00 -----------> 01 -------------> 1X -------+
^ |
| |
+------------------------------------------+
To harvest the dirty pages, userspace accesses the mmapped ring buffer
to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
The userspace should harvest this GFN and mark the flags from state
``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
to show that this GFN is harvested and waiting for a reset), and move
on to the next GFN. The userspace should continue to do this until the
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
all the dirty GFNs that were available.
Note that on weakly ordered architectures, userspace accesses to the
ring buffer (and more specifically the 'flags' field) must be ordered,
using load-acquire/store-release accessors when available, or any
other memory barrier that will ensure this ordering.
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
After processing one or more entries in the ring buffer, userspace
calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
it, so that the kernel will reprotect those collected GFNs.
Therefore, the ioctl must be called *before* reading the content of
the dirty pages.
The dirty ring can get full. When it happens, the KVM_RUN of the
vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
The dirty ring interface has a major difference comparing to the
KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
userspace, it's still possible that the kernel has not yet flushed the
processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal. The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
should be exposed by weakly ordered architecture, in order to indicate
the additional memory ordering requirements imposed on userspace when
reading the state of an entry and mutating it from DIRTY to HARVESTED.
Architecture with TSO-like ordering (such as x86) are allowed to
expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
to userspace.
After enabling the dirty rings, the userspace needs to detect the
capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the
ring structures can be backed by per-slot bitmaps. With this capability
advertised, it means the architecture can dirty guest pages without
vcpu/ring context, so that some of the dirty information will still be
maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL
hasn't been enabled, or any memslot has been existing.
Note that the bitmap here is only a backup of the ring structure. The
use of the ring and bitmap combination is only beneficial if there is
only a very small amount of memory that is dirtied out of vcpu/ring
context. Otherwise, the stand-alone per-slot bitmap mechanism needs to
be considered.
To collect dirty bits in the backup bitmap, userspace can use the same
KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all
the generation of the dirty bits is done in a single pass. Collecting
the dirty bitmap should be the very last thing that the VMM does before
considering the state as complete. VMM needs to ensure that the dirty
state is final and avoid missing dirty pages from another ioctl ordered
after the bitmap collection.
NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its
tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on
KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through
command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device
"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save
vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES}
command on KVM device "kvm-arm-vgic-v3".
8.30 KVM_CAP_XEN_HVM
--------------------
@@ -8878,65 +8920,6 @@ This capability indicates that the KVM virtual PTP service is
supported in the host. A VMM can check whether the service is
available to the guest on migration.
8.33 KVM_CAP_HYPERV_ENFORCE_CPUID
---------------------------------
:Architectures: x86
When enabled, KVM will disable emulated Hyper-V features provided to the
guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
currently implemented Hyper-V features are provided unconditionally when
Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
leaf.
8.34 KVM_CAP_EXIT_HYPERCALL
---------------------------
:Architectures: x86
:Type: vm
This capability, if enabled, will cause KVM to exit to userspace
with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.
Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
of hypercalls that can be configured to exit to userspace.
Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.
The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace
the hypercalls whose corresponding bit is in the argument, and return
ENOSYS for the others.
8.35 KVM_CAP_PMU_CAPABILITY
---------------------------
:Architectures: x86
:Type: vm
:Parameters: arg[0] is bitmask of PMU virtualization capabilities.
:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits
This capability alters PMU virtualization in KVM.
Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of
PMU virtualization capabilities that can be adjusted on a VM.
The argument to KVM_ENABLE_CAP is also a bitmask and selects specific
PMU virtualization capabilities to be applied to the VM. This can
only be invoked on a VM prior to the creation of VCPUs.
At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting
this capability will disable PMU virtualization for that VM. Usermode
should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
8.36 KVM_CAP_ARM_SYSTEM_SUSPEND
-------------------------------
:Architectures: arm64
:Type: vm
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
8.37 KVM_CAP_S390_PROTECTED_DUMP
--------------------------------
@@ -8949,22 +8932,6 @@ PV guests. The `KVM_PV_DUMP` command is available for the
dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
available and supports the `KVM_PV_DUMP_CPU` subcommand.
8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
-------------------------------------
:Architectures: x86
:Type: vm
:Parameters: arg[0] must be 0.
:Returns: 0 on success, -EPERM if the userspace process does not
have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been
created.
This capability disables the NX huge pages mitigation for iTLB MULTIHIT.
The capability has no effect if the nx_huge_pages module parameter is not set.
This capability may only be set before any vCPUs are created.
8.39 KVM_CAP_S390_CPU_TOPOLOGY
------------------------------
@@ -8989,32 +8956,6 @@ structure.
When getting the Modified Change Topology Report value, the attr->addr
must point to a byte where the value will be stored or retrieved from.
8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
---------------------------------------
:Architectures: arm64
:Type: vm
:Parameters: arg[0] is the new split chunk size.
:Returns: 0 on success, -EINVAL if any memslot was already created.
This capability sets the chunk size used in Eager Page Splitting.
Eager Page Splitting improves the performance of dirty-logging (used
in live migrations) when guest memory is backed by huge-pages. It
avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
it eagerly when enabling dirty logging (with the
KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
KVM_CLEAR_DIRTY_LOG.
The chunk size specifies how many pages to break at a time, using a
single allocation for each chunk. Bigger the chunk size, more pages
need to be allocated ahead of time.
The chunk size needs to be a valid block size. The list of acceptable
block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
64-bit bitmap (each bit describing a block size). The default value is
0, to disable the eager page splitting.
8.41 KVM_CAP_VM_TYPES
---------------------
@@ -9034,6 +8975,67 @@ Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
production. The behavior and effective ABI for software-protected VMs is
unstable.
8.42 KVM_CAP_PPC_RPT_INVALIDATE
-------------------------------
:Architectures: ppc
This capability indicates that the kernel is capable of handling
H_RPT_INVALIDATE hcall.
In order to enable the use of H_RPT_INVALIDATE in the guest,
user space might have to advertise it for the guest. For example,
IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
present in the "ibm,hypertas-functions" device-tree property.
This capability is enabled for hypervisors on platforms like POWER9
that support radix MMU.
8.43 KVM_CAP_PPC_AIL_MODE_3
---------------------------
:Architectures: ppc
This capability indicates that the kernel supports the mode 3 setting for the
"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location"
resource that is controlled with the H_SET_MODE hypercall.
This capability allows a guest kernel to use a better-performance mode for
handling interrupts and system calls.
8.44 KVM_CAP_MEMORY_FAULT_INFO
------------------------------
:Architectures: x86
The presence of this capability indicates that KVM_RUN will fill
kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
there is a valid memslot but no backing VMA for the corresponding host virtual
address.
The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
to KVM_EXIT_MEMORY_FAULT.
Note: Userspaces which attempt to resolve memory faults so that they can retry
KVM_RUN are encouraged to guard against repeatedly receiving the same
error/annotated fault.
See KVM_EXIT_MEMORY_FAULT for more information.
8.45 KVM_CAP_X86_GUEST_MODE
---------------------------
:Architectures: x86
The presence of this capability indicates that KVM_RUN will update the
KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the
vCPU was executing nested guest code when it exited.
KVM exits with the register state of either the L1 or L2 guest
depending on which executed at the time of an exit. Userspace must
take care to differentiate between these cases.
9. Known KVM API problems
=========================