Storage Performance Testing Toolkit and Configuration #1457

benyamin-codez · 2025-11-17T00:33:04Z

benyamin-codez
Nov 17, 2025

Hi everyone,

I thought it prudent to kick off a discussion around performance testing for our storage stacks.

It would be helpful if we could develop a standard toolkit and configuration.
Perhaps we have already, and I missed the doco somewhere?

In terms of reporting, what would be the preferred format?
Raw, XML, JSON, HTML, PNG plots, CSV, XLSX, ODS, PDF, or some archived combination?

From what I can tell, and probably due to our Windows focus, we tend to mainly see IOMeter and DiskSpd being used but reporting is limited and there doesn't appear to be a standardised test configuration.

Historically, we do appear to have the Daynix StorMeter bricklet too...

I do see some value in using fio due to it's flexibility and various reporting options and extensions, e.g. fio-plot.
There may also be some common ground with the QEMU storage team if this is what they use?
Perhaps someone could ask them for their take on this?

Any other suggestions?

As for configuration, we should include a variety of block sizes, queue depths and thread counts.

What would everyone like to see included?
Are there other configuration elements we should include as standard, e.g. num_cpus, alignment, parallel threads?
Are we only concerned with read operations (after suitably random test files are written)?

I would suggest at a minimum we should include (num_cpus = 4):

Test Type	Block Size	Queue Depth	Threads	Stride Size	Stride Offset
Sequential	1024 KiB	8	1	N/A	N/A
Sequential	1024 KiB	8	4	N/A	N/A
Sequential	1024 KiB	8	4	4MiB	4KiB
Sequential	1024 KiB	32	8	8MiB	4KiB
Random	4 KiB	32	16	N/A	N/A
Random	64 KiB	32	16	N/A	N/A

attn: @vrozenfe @YanVugenfirer @kostyanf14 @JonKohler @sb-ntnx @MartinCHarvey-Nutanix @MartinDrab

YanVugenfirer · 2025-11-17T08:53:49Z

YanVugenfirer
Nov 17, 2025
Maintainer

@benyamin-codez Thanks for starting the thread!

It will be nice to have some agreement about how we can test the performance impact of the proposed patches.

We use internally the IOMeter.
One comment: it worth running those tests on single vCPU and on some agreed number of multiple vCPUs. The single vCPU config is important for corner cases that might cause deadlocks.

0 replies

benyamin-codez · 2025-11-17T21:30:53Z

benyamin-codez
Nov 17, 2025
Author

Thanks Yan.

Is there a standardised internal IOMeter config you can share?
If so, we might then be able to come up with comparable configs for other tools too.

iinm, it's probably a good idea to include no MSI-X as well (along with a separate single vCPU test) to cover all those corner cases.
At least then we will know it has been checked.

Single vCPU testing aside, I've tended to settle on 4x vCPU.
The number of vCPUs should then be a factor used to determine the queue depth (# of outstanding I/Os) and thread count.

0 replies

YanVugenfirer · 2025-11-19T15:51:33Z

YanVugenfirer
Nov 19, 2025
Maintainer

@xiagao can you please share out IOMeter settings?

0 replies

YanVugenfirer · 2025-11-20T14:31:44Z

YanVugenfirer
Nov 20, 2025
Maintainer

@benyamin-codez BTW: do you see now systems without MSI-X? Are you still supporting Windows Server 2008?

1 reply

benyamin-codez Nov 21, 2025
Author

iirc, Vadim mentioned that some FPGA kit is unable to use MSI-X.
I could be mistaken, but I do remember testing for it last year...

cc: @vrozenfe

xiagao · 2025-11-21T02:51:34Z

xiagao
Nov 21, 2025

We use iometer in general IO testing with the following configuration.
Running iometer on both system disk and data disk

Click Worker1 --> Select system disk and data disk on Disk Targets tab
Click Worker2 --> Select system disk and data disk on Disk Targets tab
Access Specifications tab --> Add one Global Access Secification to assigned
Test Setup tab --> 15 Minutes --> Cycling Option to "Cycle # Outstanding I/Os - run step outstanding I/Os on all disks at a time" --> End 64 -->Power 4
Run the test

While for the performance test, usually we use FIO tool.

    rw = "read write randread randwrite randrw"
    block_size = "4k 16k 64k 256k"
    iodepth = "1 8 64"
    threads = "16"

fio_options = '--rw=%s --bs=%s --iodepth=%s --runtime=1m --direct=1 --filename=\\.\PHYSICALDRIVE1 --name=job1 --ioengine=windowsaio --thread --group_reporting --numjobs=%s --size=512MB --time_based --output=${guest_result_file}'

5 replies

benyamin-codez Nov 25, 2025
Author

@xiagao

Thanks for your post with config.

Is there a particular Global Access Specification(s) you typically use?

Some observations:

Testing the master branch with IOMeter I was unable to reproduce any of the expected errors for the regression observed in #1453. The same also for DiskSpd, although with DiskSpd you can see the point where threads are lost if monitoring IO graphically, e.g. in ProcessExplorer:

The fio config to produce error messages is:

[global]
ioengine=windowsaio
direct=1
filetype=file
size=2G
group_reporting=1
thread=1

[read_verify_rand_64k]
bs=64k
numjobs=< set to # vCPUs > # This is important to have a reliable reproducer
iodepth=64
filename=l\:\random_64k_fio.dat  # Prepared as below
rw=randread
verify=crc32c
do_verify=1
time_based=1
runtime=300  # 60s is usually enough

This is somewhat concerning that IOMeter and DiskSpd miss this.
Perhaps we need to include a fio test with verify=crc32c and do_verify=1...?

Performance Testing

For file-based performance testing I used the following to prep the files:

[global]
ioengine=windowsaio
direct=1
filetype=file
group_reporting=1
numjobs=1
iodepth=32
thread=1
randrepeat=1
stonewall=1

[prep_write_4k]
rw=randwrite
bs=4k
size=2G
filename=<test_drive>\:\random_4k_fio.dat
verify=crc32c
do_verify=0

[prep_write_64k]
rw=randwrite
bs=64k
size=2G
filename=<test_drive>\:\random_64k_fio.dat
verify=crc32c
do_verify=0

[prep_write_1024k]
rw=randwrite
bs=1024k
size=6253568K  # To fill the disk
filename=<test_drive>\:\sequential_1024k_fio.dat
verify=crc32c
do_verify=0

And tested with the following:

fio.exe --ioengine=windowsaio --direct=1 --filetype=file --size=2G --thread=1 --time_based=1 --runtime=20
-name <test_name> --rw=<read | randread> --bs=<4k | 64k | 1024k> 
--iodepth=<iodepth> --numjobs=<numjobs> --filename=<test_drive>\:\<test_file>
--output-format=json --output=<path>\<op prefix>-<iodepth>-<numjobs>.json

For disk-based performance testing I used the following to test (with and without above files in-place):

fio.exe --ioengine=windowsaio --direct=1 --size=2G --thread=1 --time_based=1 --runtime=20
-name <test_name> --rw=<read | randread> --bs=<4k | 64k | 1024k> 
--iodepth=<iodepth> --numjobs=<numjobs> --filename=\\.\PHYSICALDRIVE1
--output-format=json --output=<path>\<op prefix>-<iodepth>-<numjobs>.json

Unfortunately, several discrepancies are observable in the results.

Please find relevant data and fio-plot comparisons included in the attached zip archive for your consideration.
win11_revert-1bbc422_fio_disk_vs_file_compare.zip

A small sample follows...

Win-11---Rand-4KiB-IO-Comparison-(iodepth=32,-numjobs=16)_2025-11-26_021249_eu

Win-11---Seq-1024KiB- 8MiB-Stride -IO-Comparison-(iodepth=32,-numjobs=16)_2025-11-26_021252_nz

The disk-based tests do not appear to be accurate.
I suspect there isn't sufficiently randomised data to read and perhaps an issue with null bytes and detected zeroes.

Maybe logged data and line plots would reveal more...

Can you provide any advice or commentary from what I have provided?

I note I am presently running DiskSpd tests to compare against the file-based fio tests and will report back my findings.

cc: @YanVugenfirer, @vrozenfe

xiagao Nov 26, 2025

Including @peixiu and @menli820 in this discussion, they are the feature owner of vioscsi and viostor.
And @peixiu is also trying to reproduce #1453, could you share your reproduce process?

iops-hunter Nov 26, 2025

Hi @xiagao, @peixiu, this reproducer doesn't work for you? #1453 (comment)

benyamin-codez Nov 26, 2025
Author

@peixiu

You can use the original reproducer @iops-hunter linked to above, or use the config in my last comment above.

It seems important to match the numjobs to the number of vCPUs.
You should get some errors within 60s, certainly within 300s.

benyamin-codez Nov 27, 2025
Author

Another way to capture these types of errors would be via apparent performance degradation. I mentioned the coincident ProcessExplorer observation above, but this would also be apparent in collected performance data.

I can think of two possible sources. The most basic of the two is any disparity in per thread IO counts, which, in this instance, would be more of a qualitative indicator available in the typical test data produced by fio or DiskSpd. The second is IOPS logging, which is also available in both utilities, but is not typically enabled so that performance results are not adversely impacted (there is certainly some overhead, especially for any disk on which the log resides).

IOPS logging would enable us to report on IOPS variance, i.e. how IO rates change over time during the test. We normally wouldn't bother with this level of detail, but perhaps we have underlying issues with the stack we still need to resolve, and this might provide some insight into what is happening, or at the very least, an indication that we have some form of unexpected and undesirable behaviour.

I think it also is worth pointing out that for most of our use cases, DiskSpd is likely a better choice than fio. There are many reasons for this, not least of which is that it is purpose built with MS expertise whereas the windowsaio engine in fio appears to lack such stack expertise and has many long outstanding issues. We also have less induced latency with DiskSpd due to working features such as batched completions and look-a-sides. This enables us to have more confidence in the results. Conversely, the fact we have different results for fio tests of the same driver, is of particular concern.

Where fio comes into it's own is in the extensive debugging options, but iinm, these are somewhat limited for the windowsaio engine. There are also many reporting options readily available for fio that DiskSpd doesn't yet enjoy (noting this is something I'm presently looking at doing something about).

In any case, I am interested in what opinions others might have, especially wrt to (perhaps optional) IOPS logging and which utility serves our purposes best.

cc: @JonKohler - Jon, I was hopeful you might have an insightful and helpful contribution on this subject.

benyamin-codez · 2025-11-21T19:40:03Z

benyamin-codez
Nov 21, 2025
Author

I'm minded to also include in the discussion the subject of the backing used for performance testing.

To date, I've used a RAW, fully allocated, file-based backing outside of the QEMU global mutex.

Notable configuration elements include:

format=raw
cache=none
aio=threads
detect-zeroes=on

I also:

Use iothread objects
Set rotation_rate=1 for SSD-based backings

Does anyone have any comments or concerns with any of the above?

4 replies

iops-hunter Nov 22, 2025

Sorry @benyamin-codez, I'm sliding on the edge here, but I'm curious why the aio=threads? It reminds me last year's "Reset to device" problem. For me, "threads" is the really "weird" aio option, I only used that as a last resort. Potentially (more) stable, but not suitable for production (limited scalability for high IOPS/NVMe), and even for benchmarking. AFAIK it's not widely used (in VMs). But I suppose you've some exact reasons for this. On the other hand, I can imagine it's good for debugging, as its working seems to be "deterministic", but there's always a decent CPU payload associated with it.

Long story short: My tip is if you'll be benchmarking high-IOPS storage (like the local NVMe), you will more benchmark the host's context switching (scheduling overhead), not the IO layer itself.

benyamin-codez Nov 22, 2025
Author

It's my understanding that aio=threads with an iothread object (iothread=1 in PVE) is the only mechanism which operates outside the bounds of the QEMU global mutex (aio=io_uring) or Linux AIO implementation (aio=native) and thus minimises the influence of other processes on performance during the test. Having said that, switching to aio=native or aio=io_uring might be necessary in circumstances where commits specifically target those implementations.

The goal isn't to break performance records but to comparatively demonstrate the difference between commits.

iops-hunter Nov 22, 2025

I'm still afraid that aio=threads will hide/overload elsewhere noticeable, more nuanced differences in performance (because of extensive context switching), especially for random IO on NVMe-like storage. And IMHO this is the reason for the driver A/B testing.

benyamin-codez Nov 22, 2025
Author

Using aio=native or aio=io_uring might be appropriate in a dedicated testing environment where we only have a single VM (the test subject) and the host is suitably prepared to avoid interfering with the tests, but this would be prohibitively expensive for many community developers.

The goal here is a simple basic round of testing that any developer submitting a PR can perform.

All things being equal, the comparative nature of the test would likely still be achieved despite any increase in context switching.

iops-hunter · 2025-11-22T12:57:11Z

iops-hunter
Nov 22, 2025

What would everyone like to see included? Are there other configuration elements we should include as standard, e.g. num_cpus, alignment, parallel threads? Are we only concerned with read operations (after suitably random test files are written)?

I would suggest at a minimum we should include (num_cpus = 4):
Test Type Block Size Queue Depth Threads Stride Size Stride Offset
Sequential 1024 KiB 8 1 N/A N/A
Sequential 1024 KiB 8 4 N/A N/A
Sequential 1024 KiB 8 4 4MiB 4KiB
Sequential 1024 KiB 32 8 8MiB 4KiB
Random 4 KiB 32 16 N/A N/A
Random 64 KiB 32 16 N/A N/A

I'd add: Random - 8KiB - 64QD - 8/16 threads - 8 CPUs, just to better mimicking highly stressed, enterprise-like workload.
And testing on WS2025 would be definitely welcome (given the high concurrent-channels aggressiveness).

Expected backing: DC-class NVMe with aio=io_uring, as it's designed to:

Batch I/O efficiently
Reduce syscalls via shared submission/completion rings
Fit very well with deep NVMe queues + high IOPS workloads

As of me, the tests should reveal some complex storage virtio-accelerated QEMU performance "in-the-wild" (if possible).

6 replies

iops-hunter Nov 22, 2025

The reason is that 8kB is still a default performance target for a typical enterprise storage. Whatever it's called, cluster size, allocation size, application request size, ... it's the optimal block size used for internal data management (and corresponding functions like compression and deduplication). So while 4kB is best for maximizing local storage performance and benchmarking, 8kB is a typical block size for majority of SAN systems, where 4kB may be sub-optimal.

benyamin-codez Nov 23, 2025
Author

I'm not so sure that it is the majority of SAN solutions...

VMFS still uses a 1MiB+ block size, despite VMFS-5 using an 8KiB sub-block size (64KiB blocks were reintroduced in VMFS-6). I only mention this in a migration context where some block-aligned backings are likely being reused (with a 1MiB stripe size). I would also say as a generalisation that it is much more common to use stripe sizes larger than 8KiB in SAN LDEVs.

Some RDBMS systems would likely use 8KiB blocks - Oracle and Postgres come to mind - but iinm, SQL Server and Maria / MySQL still dominate the MS space, both of which use larger blocks.

Some hardware based deduplication kit is also 8KiB - but this is typically an optimised value for CFS storage, which helps to address misaligned backings (more common than it should be due to historical reasons) but also just queues up 4KiB blocks in ASICs at a particular price point.

Some legacy ZFS implementations may use an 8KiB block (it was the default in Solaris), but a lot of research shows that 64KiB is a better choice (even when compared to 16KiB), especially when compression is added to the mix - at least for KVM purposes.

Whilst most IOs occur with blocks of <=8KiB, most of these are still only 4KiB. Furthermore, most data is transferred with blocks of >=64KiB. This is why it is important to check both 4KiB and 64KiB. For random reads (and typically writes too) this is typically a proportional (linear) relationship when all other factors are equal, i.e. threads and queue depth. As such, testing for these two block sizes alone should cover 8KiB blocks (and larger) too. Doing so reduces test time and associated expense.

iops-hunter Nov 23, 2025

Well, this is "apples and oranges" mismatch. I was talking about SAN, i.e. Storage Area Network, aka Enterprise block storage (HW boxes, usually connected via FC o RoCE, or even iSCSI for lower-end). And here, there's a typical / default block size 8kB. This is an industry standard. There are only differences if it's just a fixed size, predefined, or user defined. But regardless of that, typical preset for "VMware" or "Hyper-V" (or "SAP HANA", etc.) is usually the same 8kB block size, because this is the optimal size for high-IOPS applications, while still preserving efficiency for add-on functions like compression and deduplication. Another presets are e.g. for "ORACLE OLAP" or "MSSQL OLAP", and here it's usually 32kB (compression on, deduplication off). And as a corner case: on some lower-end systems, if there's a risk of saturation by a massive concurrent sequential access (with add-on features ON), the recommended default block size can be increased to 16kB (while still preserving high random IO performance).

Long story short: typical default block size on Enterpise block storage is 8kB, with application-specific presets in a typical range from 4kB to 32kB, and with vendor-specific limitations. / And in addition, some vendors offer just a fixed size, with no presets /

iops-hunter Nov 23, 2025

Anyway, I wanted to conclude that 4kB benchmarks are fine, but the main point was that many SAN storage systems present/benchmark (still) the 8kB performance, so it's usually best for comparison. And either default is 4kB or 8kB, differences are a bit of vendor's "flavour", and always is recommended "check the best for your use-case", there's definitely no "one size fits all". I understand the as a common simple metric, 4kB may be sufficient here.

benyamin-codez Nov 23, 2025
Author

We are actually talking about the same thing. I note the fabric itself is largely block size agnostic.

I think you may be referring to cache block size. It is usually set to a factor for all the IO block sizes you expect from all your connected workloads. The most common default setting in most vendor setups is 8KiB. If you only have one storage subsystem for all your workloads, then the cache needs to accommodate the various workload block sizes. That being said, some vendors offer specific cache block sizes for each LDEV, LUN or LUN group.

One problem is that many vendors use legacy block sizes for various workloads. An example is SQL server, which would ideally be set to a 64KiB cache and stripe (a.k.a. volume) block size to match the workload IO block size, but most vendors will still default to an 8KiB cache block size. This is deemed acceptable because the 64KiB IO is made up of 8x8KiB contiguous extents and the cache is fast, especially when the stripe is still 64KiB.

Between 45 and 60% of all IOs are <=8KiB, but if you don't actually have any 8KiB workloads, then these are likely to be all 4KiB IOs. Using an 8KiB cache block size in these circumstances will result in 2x 4KiB blocks queuing up together in the cache, but this isn't always an advantage.

Whilst such features will further improve the storage subsystem when implemented, in our case, for our comparative performance tests here, we need to bypass all such caching, i.e. we need to perform our tests without caching affecting our results. This is why we need to ensure that all caching is disabled in the target backing. The same goes for compression and deduplication - we don't want these features affecting our results.

peixiu · 2025-11-26T09:55:53Z

peixiu
Nov 26, 2025

On Wed, Nov 26, 2025 at 5:15 PM benyamin-codez ***@***.***> wrote: @peixiu <https://github.com/peixiu> You can use the original reproducer @iops-hunter <https://github.com/iops-hunter> linked to above, or use the config in my last comment above. It seems important to match the numjobs to the number of vCPUs. You should get some errors within 60s, certainly within 300s.

OK, thank you, I'll have a try with these tips. And sorry for not taking much time to test it completely, the reproduction is in progress. Any result will update on the issue. Thanks~ Peixiu

…

— Reply to this email directly, view it on GitHub <#1457 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFRCOJ3DF7YW6X4BZET6IS336VVRBAVCNFSM6AAAAACMI6VYVGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKMBYGQZDAMI> . You are receiving this because you were mentioned.Message ID: <virtio-win/kvm-guest-drivers-windows/repo-discussions/1457/comments/15084201 @github.com>

2 replies

benyamin-codez Nov 26, 2025
Author

Thanks @peixiu.

We have 4 patch candidates.
I am performance testing these at the moment, hence the mention here.
We know the root cause was commit 1bbc422. @menli820 - the same for viostor was 31880de.
The offending commits resolve errors and BSODs during the HLK Disk Stress Test.
I've been able to reproduce the vioscsi issue down to Win10 build 1809 but haven't checked viostor...

frwbr Nov 26, 2025

Hi, just in case it helps, I just posted the steps that trigger this issue for me on a bare-metal Ubuntu+libvirt installation here: #1453 (comment)

Storage Performance Testing Toolkit and Configuration #1457

Uh oh!

Replies: 8 comments · 18 replies

Uh oh!

YanVugenfirer Nov 17, 2025 Maintainer

Uh oh!

benyamin-codez Nov 17, 2025 Author

Uh oh!

YanVugenfirer Nov 19, 2025 Maintainer

Uh oh!

YanVugenfirer Nov 20, 2025 Maintainer

Uh oh!

benyamin-codez Nov 21, 2025 Author

Uh oh!

Uh oh!

benyamin-codez Nov 25, 2025 Author

Uh oh!

Uh oh!

Uh oh!

benyamin-codez Nov 26, 2025 Author

Uh oh!

benyamin-codez Nov 27, 2025 Author

Uh oh!

benyamin-codez Nov 21, 2025 Author

Uh oh!

Uh oh!

benyamin-codez Nov 22, 2025 Author

Uh oh!

Uh oh!

benyamin-codez Nov 22, 2025 Author

Uh oh!

Uh oh!

Uh oh!

benyamin-codez Nov 23, 2025 Author

Uh oh!

Uh oh!

Uh oh!

benyamin-codez Nov 23, 2025 Author

Uh oh!

Uh oh!

benyamin-codez Nov 26, 2025 Author

Uh oh!

Replies: 8 comments 18 replies

YanVugenfirer
Nov 17, 2025
Maintainer

benyamin-codez
Nov 17, 2025
Author

YanVugenfirer
Nov 19, 2025
Maintainer

YanVugenfirer
Nov 20, 2025
Maintainer

benyamin-codez Nov 21, 2025
Author

benyamin-codez Nov 25, 2025
Author

benyamin-codez Nov 26, 2025
Author

benyamin-codez Nov 27, 2025
Author

benyamin-codez
Nov 21, 2025
Author

benyamin-codez Nov 22, 2025
Author

benyamin-codez Nov 22, 2025
Author

benyamin-codez Nov 23, 2025
Author

benyamin-codez Nov 23, 2025
Author

benyamin-codez Nov 26, 2025
Author