This also changes the return values, since their meanings are rather
weird from the storage point of view. For instance, "internal" meant
it is *not* the storage which does the snapshot, while "external"
meant a mixture of storage and qemu-server side actions. `undef` meant
the storage does it all...
┌────────────┬───────────┐
│ previous │ new │
├────────────┼───────────┤
│ "internal" │ "qemu" │
│ "external" │ "mixed" │
│ undef │ "storage" │
└────────────┴───────────┘
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Just to safe, as this is already check higher in the stack.
Technically, it's possible to implement snapshot file renaming,
and update backing_file info with "qemu-img rebase -u".
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
both source and target snapshot need to be provided when renaming.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
by moving the preallocation handling to the call site, and preparing
them for taking further options like cluster size in the future.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
we format lvm logical volume with qcow2 to handle snapshot chain.
like for qcow2 file, when a snapshot is taken, the current lvm volume
is renamed to snap volname, and a new current lvm volume is created
with the snap volname as backing file
snapshot volname is similar to lvmthin : snap_${volname}_{$snapname}.qcow2
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
add a snapext option to enable the feature
When a snapshot is taken, the current volume is renamed to snap volname
and a current image is created with the snap volume as backing file
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
Returns if the volume is supporting qemu snapshot:
'internal' : do the snapshot with qemu internal snapshot
'external' : do the snapshot with qemu external snapshot
undef : does not support qemu snapshot
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
This add a $running param to volume_snapshot,
it can be used if some extra actions need to be done at the storage
layer when the snapshot has already be done at qemu level.
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
template guests are never running and never write
to their disks/mountpoints, those $running parameters there can be
dropped.
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
This compute the whole size of a qcow2 volume with datas + metadatas.
Needed for qcow2 over lvm volume.
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
and use it for plugin linked clone
This also enable extended_l2=on, as it's mandatory for backing file
preallocation.
Preallocation was missing previously, so it should increase performance
for linked clone now (around x5 in randwrite 4k)
cluster_size is set to 128k, as it reduce qcow2 overhead (reduce disk,
but also memory needed to cache metadatas)
l2_extended is not enabled yet on base image, but it could help too
to reduce overhead without impacting performance
bench on 100G qcow2 file:
fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k --iodepth=32 --ioengine=libaio --name=test
fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k --iodepth=32 --ioengine=libaio --name=test
base image:
randwrite 4k: prealloc=metadata, l2_extended=off, cluster_size=64k: 20215
randread 4k: prealloc=metadata, l2_extended=off, cluster_size=64k: 22219
randwrite 4k: prealloc=metadata, l2_extended=on, cluster_size=64k: 20217
randread 4k: prealloc=metadata, l2_extended=on, cluster_size=64k: 21742
randwrite 4k: prealloc=metadata, l2_extended=on, cluster_size=128k: 21599
randread 4k: prealloc=metadata, l2_extended=on, cluster_size=128k: 22037
clone image with backing file:
randwrite 4k: prealloc=metadata, l2_extended=off, cluster_size=64k: 3912
randread 4k: prealloc=metadata, l2_extended=off, cluster_size=64k: 21476
randwrite 4k: prealloc=metadata, l2_extended=on, cluster_size=64k: 20563
randread 4k: prealloc=metadata, l2_extended=on, cluster_size=64k: 22265
randwrite 4k: prealloc=metadata, l2_extended=on, cluster_size=128k: 18016
randread 4k: prealloc=metadata, l2_extended=on, cluster_size=128k: 21611
Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
In 7684225 ("ceph/rbd: set 'keyring' in ceph configuration for
externally managed RBD storages") the ceph config creation was packed
into a new function and checked whether the installation is an external
Ceph cluster or not.
However, a check was forgotten in the RBDPlugin which is now added.
Without this check a configuration in /etc/pve/priv/ceph/<pool>.conf is
created and pvestatd complains
pvestatd[1144]: ignoring custom ceph config for storage 'pool', 'monhost' is not set (assuming pveceph managed cluster)! because the file /etc/pve/priv/ceph/pool.conf
Fixes: 7684225 ("ceph/rbd: set 'keyring' in ceph configuration for externally managed RBD storages")
Signed-off-by: Hannes Duerr <h.duerr@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
Link: https://lore.proxmox.com/20250716130117.71785-1-h.duerr@proxmox.com
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This reduces the potential breakage from commit "fix #5071: zfs over
iscsi: add 'zfs-base-path' configuration option". Only setups where
'/dev/zvol' exists, but is not a valid base, will still be affected.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Christoph Heiss <c.heiss@proxmox.com>
Link: https://lore.proxmox.com/20250605111109.52712-2-f.ebner@proxmox.com
Use '/dev/zvol' as a base path for new storages for providers 'iet'
and 'LIO', because that is what modern distributions use.
This is a breaking change regarding the addition of new storages on
older distributions, but it's enough to specify the base path '/dev'
explicitly for setups that require it.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Christoph Heiss <c.heiss@proxmox.com>
Link: https://lore.proxmox.com/20250605111109.52712-1-f.ebner@proxmox.com
When discovering a new volume group (VG), for example on boot, LVM
triggers autoactivation. With the default settings, this activates all
logical volumes (LVs) in the VG. Activating an LV creates a
device-mapper device and a block device under /dev/mapper.
Autoactivation is problematic for shared LVM storages, see #4997 [1].
For the inherently local LVM-thin storage it is less problematic, but
it still makes sense to avoid unnecessarily activating LVs and thus
making them visible on the host at boot.
To avoid that, disable autoactivation after creating new LVs. lvcreate
on trixie does not accept the --setautoactivation flag for thin LVs
yet, support was only added with [2]. Hence, setting the flag is is
done with an additional lvchange command for now. With this setting,
LVM autoactivation will not activate these LVs, and the storage stack
will take care of activating/deactivating LVs when needed.
The flag is only set for newly created LVs, so LVs created before this
patch can still trigger #4997. To avoid this, users will be advised to
run a script to disable autoactivation for existing LVs.
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=4997
[2] 1fba3b876b
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
Link: https://lore.proxmox.com/20250709141034.169726-3-f.weber@proxmox.com
When discovering a new volume group (VG), for example on boot, LVM
triggers autoactivation. With the default settings, this activates all
logical volumes (LVs) in the VG. Activating an LV creates a
device-mapper device and a block device under /dev/mapper.
This is not necessarily problematic for local LVM VGs, but it is
problematic for VGs on top of a shared LUN used by multiple cluster
nodes (accessed via e.g. iSCSI/Fibre Channel/direct-attached SAS).
Concretely, in a cluster with a shared LVM VG where an LV is active on
nodes 1 and 2, deleting the LV on node 1 will not clean up the
device-mapper device on node 2. If an LV with the same name is
recreated later, the leftover device-mapper device will cause
activation of that LV on node 2 to fail with:
> device-mapper: create ioctl on [...] failed: Device or resource busy
Hence, certain combinations of guest removal (and thus LV removals)
and node reboots can cause guest creation or VM live migration (which
both entail LV activation) to fail with the above error message for
certain VMIDs, see bug #4997 for more information [1].
To avoid this issue in the future, disable autoactivation when
creating new LVs using the `--setautoactivation` flag. With this
setting, LVM autoactivation will not activate these LVs, and the
storage stack will take care of activating/deactivating the LV (only)
on the correct node when needed.
This additionally fixes an issue with multipath on FC/SAS-attached
LUNs where LVs would be activated too early after boot when multipath
is not yet available, see [3] for more details and current workaround.
The `--setautoactivation` flag was introduced with LVM 2.03.12 [2], so
it is available since Bookworm/PVE 8, which ships 2.03.16. Nodes with
older LVM versions ignore the flag and remove it on metadata updates,
which is why PVE 8 could not use the flag reliably, since there may
still be PVE 7 nodes in the cluster that reset it on metadata updates.
The flag is only set for newly created LVs, so LVs created before this
patch can still trigger #4997. To avoid this, users will be advised to
run a script to disable autoactivation for existing LVs.
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=4997
[2] https://gitlab.com/lvmteam/lvm2/-/blob/main/WHATS_NEW
[3] https://pve.proxmox.com/mediawiki/index.php?title=Multipath&oldid=12039#FC/SAS-specific_configuration
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
Link: https://lore.proxmox.com/20250709141034.169726-2-f.weber@proxmox.com
Introduce qemu_blockdev_options() plugin method.
In terms of the plugin API only, adding the qemu_blockdev_options()
method is a fully backwards-compatible change. When qemu-server will
switch to '-blockdev' however, plugins where the default implemenation
is not sufficient, will not be usable for virtual machines anymore.
Therefore, this is intended for the next major release, Proxmox VE 9.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
FG: fixed typo, add paragraph break
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
for better backwards compatibility. This also means using path()
rather than filesystem_path() as the latter does not return protocol
paths.
Some protocol paths are not implemented (considered all that are
listed by grepping for '\.protocol_name' in QEMU):
- ftp(s)/http(s), which would access web servers via curl. This one
could be added if there is enough interest.
- nvme://XXXX:XX:XX.X/X, which would access a host NVME device.
- null-{aio,co}, which are mainly useful for debugging.
- pbs, because path-based access is not used anymore for PBS,
live-restore in qemu-server already defines a driver-based device.
- nfs and ssh, because the QEMU build script used by Proxmox VE does
not enable them.
- blk{debug,verify}, because they are for debugging.
- the ones used by blkio, i.e. io_uring, nvme-io_uring,
virtio-blk-vfio-pci, virtio-blk-vhost-user and
virtio-blk-vhost-vdpa, because the QEMU build script used by Proxmox
VE does not enable blkio.
- backup-dump and zeroinit, because they should not be used by the
storage layer directly.
- gluster, because support is dropped in Proxmox VE 9.
- host_cdrom, because the storage layer should not access host CD-ROM
devices.
- fat, because it hopefully isn't used by any third-party plugin here.
Co-developed-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Everything the default plugin method implementation can return is
allowed, so there is no breakage introduced by this patch.
By far the most common drivers will be 'file' and 'host_device', which
the default implementation of the plugin method currently uses. Other
quite common ones will be 'iscsi' and 'nbd'. There might also be
plugins with 'rbd' and it is planned to support QEMU protocol-paths in
the default plugin method implementation, where the 'rbd:' protocol
will also be supported.
Plugin authors are encouraged to request additional drivers and
options based on their needs on the pve-devel mailing list. The list
just starts out more restrictive, but everything where there is no
good reason to not allow could be allowed in the future upon request.
Suggested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Plugins can guard based on the machine version to be able to switch
drivers or options in a safe way without the risk of breaking older
versions.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
This is mostly in preparation for external qcow2 snapshot support.
For internal qcow2 snapshots, which currently are the only supported
variant, it is not possible to attach the snapshot only. If access to
that is required it will need to be handled differently, e.g. via a
FUSE/NBD export.
Such accesses are currently not done for running VMs via '-drive'
either, so there still is feature parity.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
For '-drive', qemu-server sets special cache options for EFI disk
using RBD. In preparation to seamlessly switch to the new '-blockdev'
interface, do the same here. Note that the issue from bug #3329, which
is solved by these cache options, still affects current versions.
With -blockdev, the cache options are split up. While cache.direct and
cache.no-flush can be set in the -blockdev options, cache.writeback is
a front-end property and was intentionally removed from the -blockdev
options by QEMU commit aaa436f998 ("block: Remove cache.writeback from
blockdev-add"). It needs to be configured as the 'write-cache'
property for the ide-hd/scsi-hd/virtio-blk device.
The default is already 'writeback' and no cache mode can be set for an
EFI drive configuration in Proxmox VE currently, so there will not be
a clash.
┌─────────────┬─────────────────┬──────────────┬────────────────┐
│ │ cache.writeback │ cache.direct │ cache.no-flush │
├─────────────┼─────────────────┼──────────────┼────────────────┤
│writeback │ on │ off │ off │
├─────────────┼─────────────────┼──────────────┼────────────────┤
│none │ on │ on │ off │
├─────────────┼─────────────────┼──────────────┼────────────────┤
│writethrough │ off │ off │ off │
├─────────────┼─────────────────┼──────────────┼────────────────┤
│directsync │ off │ on │ off │
├─────────────┼─────────────────┼──────────────┼────────────────┤
│unsafe │ on │ off │ on │
└─────────────┴─────────────────┴──────────────┴────────────────┘
Table from 'man kvm'.
Alternatively, the option could only be set once when allocating the
RBD volume. However, then we would need to detect all cases were a
volume could potentially be used as an EFI disk later. Having a custom
disk type would help a lot there. The approach here was chosen as it
is catch-all and should not be too costly either.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
For QEMU, when using '-blockdev', there is no way to specify the
keyring file like was possible with '-drive', so it has to be set in
the corresponding Ceph configuration file. As it applies to all images
on the storage, it also is the most natural place for the setting.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
ZFS does not have a filesystem_path() method, so the default
implementation for qemu_blockdev_options() cannot be re-used. This is
most likely, because snapshots are currently not directly accessible
via a filesystem path in the Proxmox VE storage layer.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
This is in preparation to switch qemu-server from using '-drive' to
the modern '-blockdev' in the QEMU commandline options as well as for
the qemu-storage-daemon, which only supports '-blockdev'. The plugins
know best what driver and options are needed to access an image, so
a dedicated plugin method returning the necessary parameters for
'-blockdev' is the most straight-forward.
There intentionally is only handling for absolute paths in the default
plugin implementation. Any plugin requiring more needs to implement
the method itself. With PVE 9 being a major release and most popular
plugins not using special protocols like 'rbd://', this seems
acceptable.
For NBD, etc. qemu-server should construct the blockdev object.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>