Thursday, May 4, 2023

Mixups with rwmixread

I saw an interesting question about FIO on Twitter. I will paraphrase here:

I observed that tests ran against a Ceph block device with FIO using the randrw parameter produce the exact same IOPS on read and write channels, but in other IO tests with the same drive, I saw the same amount of write IOPS but much higher read IOPS. Why?

The glib answer is "because that's what you told FIO what to do." When using rw=randrw, the default mix of reads and writes is 50/50. FIO will dutifully enforce this ratio in terms of IOPS. Outside of FIO, there is nothing forcing a 50/50 mix of read and write IOPS and the SSD is free to service more reads if it has the bandwidth to do so.

The core issue is that read and write performance of SSDs is not symmetrical, but using the rwmixread parameter binds their performance together though IOPS. This has the tendency of producing lower read IOPS than what the SSD is otherwise capable of servicing. Consider a workload that is 70% read and 30% write, but the reads are 4KiB while writes are 256KiB. Running this against a Gen4 NVMe SSD, we get the following results:

read: IOPS=57.3k, BW=224MiB/s
write: IOPS=24.5k, BW=6122MiB/s  

The write throughput was saturated at 24.5k IOPS, and thus reads were limited to 57.3k IOPS to preserve the requested 70/30 IOPS ratio. If instead the write IOPS are fixed at 24.5k IOPS through the rate_iops option and the reads are allowed to be serviced as fast as possible, the results are much different:

read: IOPS=515k, BW=2011MiB/s
write: IOPS=24.5k, BW=6125MiB/s

Nearly 10x the number of read IOPS are achieved with the same write throughout. Of course, this is an extreme example using very different block sizes, but similar issues can arise even if reads and writes are the same block size, particularly in the presence of garbage collection. 

A more pernicious problem that can be hidden with rwmixread is how different SSD firmware prioritizes reads and writes. Some firmware implementations may favor reads over writes. When the rwmixread parameter is used, FIO must hold off submissions of IOs in one direction until the number of completions in the other direction meets the desired ratio. Thus a drive with read prioritization is forced to complete writes it may otherwise have deferred in the presence of further reads. Such a drive will behave very differently under different conditions.

My personal opinion is that the rwmixread parameter is over-used when evaluating SSDs. I prefer to separate reads and writes into their own threads and measure their interaction under different loads. Alas, this is a subject for another day.

  

Three Things I Wish Every Storage Software Vendor Provided

In my work on SSDs, I have the opportunity to test a wide variety of storage software from parallel filesystems to high performance database...