Sunday, February 23, 2025

The LZ4 Trade-Off (Illustrated and Simplified)

LZ4, gzip (a.k.a. zlib or DEFLATE), and increasingly Zstandard are the dominant choices for general-purpose, lossless compression. It's often stated that LZ4 is the most "CPU friendly" and thus fastest algorithm, but the trade-off is a lower compression ratio. On the other hand, gzip compresses better but is much slower. But what does "CPU friendly" mean and how does LZ4 compare to gzip and Zstandard?

Compression is an immensely complicated topic [1], but we can distill compression algorithms down to two primary techniques: 

  1. Eliminating repeated data
  2. Encoding data in a more compact format

In general, the first technique aligns well to modern CPUs while the second is more challenging. A "CPU-friendly" algorithm like LZ4 skips the second compression technique in order to avoid multiple passes over the data or working with non-byte aligned values. Instead, it only performs the first technique, working over a single pass through the data on byte-sized values that CPUs can process efficiently (especially with SIMD). It is this single-pass, byte oriented approach that makes LZ4 very fast. Let's explore the trade-offs with some examples.

Consider the string "The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep." We can use Python to write this to a binary file:

string = "The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep"
            
with open("string_to_compress.bin", "wb") as f:
        f.write(string.encode('utf-8'))

We can print the contents of the generated file with `strings`:

$ strings string_to_compress.bin
The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep

We can see that there is some repeated text. For example, "quick brown fox" and "the lazy dog" appear twice. LZ4 will replace the second occurrence of this text with a (smaller) reference to the first occurrence of the text. And we can observe this by using LZ4 to compress the data:

$ lz4 string_to_compress.bin
Compressed filename will be : string_to_compress.bin.lz4 
Compressed 112 bytes into 107 bytes ==> 95.54%  

If we dump the strings in the compressed file, we can clearly see where space was saved:

$ strings string_to_compress.bin.lz4 | tr '\n' ' '
&The quick brown fox jumps over the lazy dog. But that2 better be sure$  is asleep

In the compressed data, the second occurrences of "quick brown fox" and "the lazy dog" are indeed eliminated. But where LZ4 keeps it simple by swapping out repeats for pointers (or back references) to eliminate repeated data, algorithms like gzip and Zstandard take it further by adding the second general compression technique of optimizing how the data itself is written, squeezing it into a tighter format. Let’s see how that works with a different kind of data.

Let's consider data that only consists of 16 distinct byte values and create a 4kB file with them. 

byte_array = bytes([
    0x01, 0x23, 0x45, 0x67,
    0x89, 0xab, 0xcd, 0xef,
    0xaa, 0x55, 0x00, 0xff,
    0xde, 0xad, 0xc0, 0xbe
])

num_bytes_to_write = 4096  # 4kB 

with open("bytes_to_compress.bin", "wb") as f:
    for _ in range(num_bytes_to_write):
        byte = random.choice(byte_array)
        f.write(bytes([byte]))

Our intuition suggests that since the data consists of only 16 different byte values repeated many times in a 4kB file, it should be highly compressible. But if we attempt to compress the resulting data with LZ4, no space is saved.

$ lz4 bytes_to_compress.bin 
Compressed filename will be : bytes_to_compress.bin.lz4 
Compressed 4096 bytes into 4111 bytes ==> 100.37%    

On the other hand, compressing with gzip results in a file that is only 2347 bytes, a little more than half the original size. 

$ gzip -v bytes_to_compress.bin
bytes_to_compress.bin: 42.7% -- replaced with bytes_to_compress.bin.gz

The trick is that with only 16 distinct values, we do not need to represent each value with 8-bits. With only 16 choices, we should ideally only need log2(16) = 4 bits to uniquely identify all the possible values that could appear in the data. Let's use infgen [2] to analyze the contents of the gzip representation of the file. The infgen tool allows us to see (among other things) how many bits are being used to represent each value in our set of bytes.
 
litlen 0x00  4   # The value 0x00 is encoded in 4 bits
litlen 0x01  4   # The value 0x01 is encoded in 4 bits     
litlen 0x23  5   # The value 0x23 is encoded in 5 bits
litlen 0x45  5   # and so on... 
litlen 0x55  5
litlen 0x67  5
litlen 0x89  4
litlen 0xaa  5   
litlen 0xab  4
litlen 0xad  4
litlen 0xbe  4
litlen 0xc0  6
litlen 0xcd  5
litlen 0xde  4
litlen 0xef  5
litlen 0xff  5      

In the gzip representation of the data, each possible byte value is replaced with a shorter encoded value of between 4 and 6 bits (but not quite the ideal value of 4 due to overhead added in earlier phases of the compression algorithm). LZ4 simply skips this "variable length coding" exercise altogether. In this extreme case, it doesn't take advantage of the data compressibility at all.

Zstandard is similar to gzip in that it includes both general compression techniques; however, it swaps out the algorithm used for variable length encoding with one that is faster to execute on modern CPUs (among other fine tuning). One of the advantages of Zstandard is that it can be configured to trade-off compression ratio for improved speed over a very wide range; from near-LZ4 levels of throughput to better-than-gzip levels of compression. It's as close to a "one stop shop" for software compression as one can get.

Of course, no compression done in software comes without a performance penalty. LZ4 attempts to minimize the impact and achieves very high throughput; however, there are other side effects like cache pollution that are not easily captured in simple throughput measurements. This is something to be explored in another blog post.

[1] "Understanding Compression" offers a great overview of compression that is perfect for software developers. (https://www.oreilly.com/library/view/understanding-compression/9781491961520/)

[2] https://github.com/madler/infgen

Tuesday, March 19, 2024

Three Things I Wish Every Storage Software Vendor Provided

In my work on SSDs, I have the opportunity to test a wide variety of storage software from parallel filesystems to high performance databases. SSD device vendors are always interested in demonstrating compatibility and (ideally) great performance at the application level. In my time working on "the other side of the wire," there are three things I wish every storage software vendor provided to benefit the relationship between SSDs, the application, and ultimately end users:

1. Create a storage-specific microbenchmark. This benchmark should use as much of the application's IO stack as possible and focus on the aspect(s) of performance most relevant to the application. It should measure a very limited number of metrics (or even a single key metric). The hardware requirements for the benchmark should be kept modest, using the minimum quantity of SSDs and servers. If your benchmark requires 6+ servers and multiple clients, few will want to run it more than once (if at all), let alone run it regularly. A well-written storage microbenchmark helps SSD vendors articulate the value of their products in metrics that are important to the application and helps end users validate the performance claims of an SSD device vendor without the complexity and cost of testing at full application scale. Clever storage software vendors will create scores and scoreboards to encourage vendors to promote their results. Microbenchmarks can be put in the SQA flow as extended performance monitors to guard against application-specific performance regressions. One of the best examples of a well-executed storage microbenchmark is the Aerospike Certification Tool (ACT) It uses tail latency as its figure of merit.

2. Create a storage certification or compatibility program. This certification can be done in-house, provided as a self-certification to vendors, or enabled through a 3rd party. Each have their advantages. The ultimate goal is to help end users make device selections that improve their chance of first-pass success when deploying the application. Create a logo program and/or certification/compatibility list so that SSD vendors can promote that they work with the application. The certification can be in written form (i.e., a test plan), but is ideally partially or fully automated. The more that it is automated, the easier it is to run frequently, or integrate into regular regression testing. VMware (like them or not these days) has a very complete certification program. Once passed, vendors get listed on a public certification site and gain access to a logo program (e.g., "VMware Ready"). Not all certifications must be nearly as complex to add significant value. A simple compatibility test goes a long way in ensuring first-pass customer success. 

3. Create an SSD specification. The SSD specification should outline all the requirements needed to operate successfully in the application. Beyond hard requirements, an SSD specification can also provide guidance on what is desired in the future. This gives SSD device vendors something to consider when designing new products or prioritizing new feature development. The largest example of such a specification is the OCP Datacenter SSD Specification developed by Meta, Microsoft, and other large organizations (the latest version can be found in the OCP contribution database). While this specification is extremely large, even a couple thoughtful pages provided by a storage software provider helps SSD device vendors understand what is needed to operate successfully with the application. Requirements listed in an SSD specification can and should be validated by a certification or compatibility test.

TL;DR: My dream conversation with a storage software vendor would start with something like "here is my SSD specification, here is a microbenchmark you can run to see how you compare to other SSD devices, and here is a certification test you can run to get on our compatibility list."



  


Monday, July 10, 2023

The Weight of Data

Many of you have heard that SSDs become ever so slightly "heavier" as they're written due to the fact that programming a cell adds electrons to a floating gate. Electrons do have mass, however small. Let's do some arithmetic to see what all that data weighs on a modern SSD.

We know an electron weighs 9.1093837 x 10^-28 grams. We just need to estimate how many electrons are contained in a full SSD. Let's do this per user TB. Straight away, we'll take away the decimal conversion and start with 1TiB. This capacity does not count the spare area of NAND, so let's bump that capacity by about 10% to 1.1TiB. Yes, these small adjustments are silly, but let's just have fun. Because data must be scrambled before being written to NAND, we should expect that a fully written SSD has about the same quantity of "0" and "1" data internally. Most NAND these days stores more than one bit per cell, so let's restate this as saying that all programming states are distributed equally across all cells. If the maximum program state requires 300 electrons (a guesstimate that I hope someone corrects), then we can estimate that cells contain 150 electrons on average. Since most SSDs use TLC NAND, that's 50 electrons per bit.

So, 1.1TiB is 8.8TiBit. Multiply by 50 electrons/bit and we have 440 "TiElectrons" per user TB, or 440 x 1024^4 electrons. So finally we get (440 x 1024^4) * (9.1093837 x 10^-28) = 4.40698 x 10^-13 grams in one TB of user data stored on a TLC SSD. Per Wikipedia, this is about the weight of a "prochlorococcus cyanobacteria, the smallest (and possibly most plentiful) photosynthetic organism on Earth." A Zettabyte of such data would have the equivalent weight of a mosquito. How cool is that?

Thursday, May 4, 2023

Mixups with rwmixread

I saw an interesting question about FIO on Twitter. I will paraphrase here:

I observed that tests ran against a Ceph block device with FIO using the randrw parameter produce the exact same IOPS on read and write channels, but in other IO tests with the same drive, I saw the same amount of write IOPS but much higher read IOPS. Why?

The glib answer is "because that's what you told FIO what to do." When using rw=randrw, the default mix of reads and writes is 50/50. FIO will dutifully enforce this ratio in terms of IOPS. Outside of FIO, there is nothing forcing a 50/50 mix of read and write IOPS and the SSD is free to service more reads if it has the bandwidth to do so.

The core issue is that read and write performance of SSDs is not symmetrical, but using the rwmixread parameter binds their performance together though IOPS. This has the tendency of producing lower read IOPS than what the SSD is otherwise capable of servicing. Consider a workload that is 70% read and 30% write, but the reads are 4KiB while writes are 256KiB. Running this against a Gen4 NVMe SSD, we get the following results:

read: IOPS=57.3k, BW=224MiB/s
write: IOPS=24.5k, BW=6122MiB/s  

The write throughput was saturated at 24.5k IOPS, and thus reads were limited to 57.3k IOPS to preserve the requested 70/30 IOPS ratio. If instead the write IOPS are fixed at 24.5k IOPS through the rate_iops option and the reads are allowed to be serviced as fast as possible, the results are much different:

read: IOPS=515k, BW=2011MiB/s
write: IOPS=24.5k, BW=6125MiB/s

Nearly 10x the number of read IOPS are achieved with the same write throughout. Of course, this is an extreme example using very different block sizes, but similar issues can arise even if reads and writes are the same block size, particularly in the presence of garbage collection. 

A more pernicious problem that can be hidden with rwmixread is how different SSD firmware prioritizes reads and writes. Some firmware implementations may favor reads over writes. When the rwmixread parameter is used, FIO must hold off submissions of IOs in one direction until the number of completions in the other direction meets the desired ratio. Thus a drive with read prioritization is forced to complete writes it may otherwise have deferred in the presence of further reads. Such a drive will behave very differently under different conditions.

My personal opinion is that the rwmixread parameter is over-used when evaluating SSDs. I prefer to separate reads and writes into their own threads and measure their interaction under different loads. Alas, this is a subject for another day.

  

Monday, April 10, 2023

NVMe & sysfs: a helper script

The Linux NVMe driver exports a really useful set of information to sysfs. While there's absolutely no comparison to nvme-cli's capabilities, there are a few nice things about using the information reported by sysfs to do basic inspection of the NVMe drives connected to a system. First, the information is available in user space, so super user privileges are not required. Second, there are no dependencies to download. Lastly, you can get some information that's not available in nvme-cli, or would require a set of nvme-cli commands to execute and parse.

I'm often adding drives to different systems and interested in the drives' NUMA affinity and whether they negotiated to the proper PCIe generation and link width. It's very easy to retrieve this information via sysfs and print it along with other useful information about the drives such as model numbers, firmware revisions, and namespace configurations.

I have a small Bash function that I use in my own custom .rc file. The output looks like this:

$ lsnvme

   Controller (PCI Address)    : nvme0 (0000:e6:00.0)
   Model                       : Dell DC NVMe PE8010 RI U.2 960GB
   Controller Type & Transport : reserved over pcie
   NUMA Node                   : 1
   Link Information            : 16.0 GT/s PCIe x4
   Serial Number               : SSA9N4572I1309E1N
   Firmware Revision           : 1.1.0
   Attached Namespace Count    : 1

      Namespace nvme0n1 (nsid = 1)
         Formatted / Physical Sector Size : 512 / 512
         Size in GB (512-Byte Sectors)    : 960 (1875385008)
         Globally Unique ID - NGUID       : NA
         Universally Unique ID - UUID     : NA
         Worldwide ID - WWID              : eui.ace42e00162bbe86

   Controller (PCI Address)    : nvme1 (0000:17:00.0)
   Model                       : CSD-3310
   Controller Type & Transport : io over pcie
   NUMA Node                   : 0
   Link Information            : 16.0 GT/s PCIe x4
   Serial Number               : UE2237C0787M
   Firmware Revision           : U3219141
   Attached Namespace Count    : 1

      Namespace nvme1n1 (nsid = 1)
         Formatted / Physical Sector Size : 512 / 4096
         Size in GB (512-Byte Sectors)    : 3840 (7501476528)
         Globally Unique ID - NGUID       : 55453232-3337-4330-4f55-49013738374d
         Universally Unique ID - UUID     : 55453232-3337-4330-4f55-49013738374d
         Worldwide ID - WWID              : eui.55453232333743304f5549013738374d


The dirt simple Bash function I use can be found here. I have an NVMe-centric way of looking at drives (controllers being very distinct entities from their attached namespaces) and often deal with multiple namespaces. The output format I prefer may not be optimal for someone prefers a namespace-centric presentation of the same information. The main point here is that if you find yourself needing to work with NVMe drives frequently, it might be worth adding a sysfs-based NVMe inspection function to your shell that's suited to your needs.

Tuesday, April 4, 2023

Why SSDs "Lie" About Flush

A Flush command requests that an SSD controller writes any data held in its volatile write cache to the NAND media. The intent is to ensure that all completed writes by the controller are actually persisted to the media. I sometimes hear people complain that SSDs "lie" about the Flush command, implying some nefarious, lazy, or outright bad firmware design. The truth is that Flush is quite a pernicious operation for NAND-based SSDs, and SSD designers are forced to balance between honoring Flush as it is intended and protecting the SSD from premature wear-out in the face of excessive Flush operations.

The root of the problem is the mismatch between host write units and the minimum write unit of the SSD, which is a NAND page (it could be larger depending on the controller design). Hosts typically write to SSDs in units of 4kB while a NAND page today is commonly 16kB. An SSD controller would like to accumulate enough data from the host to write out at least a single page of data to NAND. In other words, the controller would like to accumulate 4 x 4kB writes in order to fill a 16kB NAND page. If the host has only written a single 4kB sector, then issues a Flush, the SSD has only two unsavory options: honor the Flush by padding the data to the NAND page size and writing it out to the media, or ignore the Flush since there is insufficient data to fill a NAND page. 

Consider a host that would like to Flush on every 4kB write. If the SSD faithfully honors every Flush command, there is an immediate 4x write amplification penalty as only a quarter of each 16kB NAND page is used for host data. The SSD will also require garbage collection much earlier (at just 25% full), which compounds the write amplification penalty. In short, there is no way for an SSD to both honor Flush at all times and maintain its rated endurance in the presence of a Flush-happy host.

"Enterprise" or "data center" drives solve the Flush problem by using on-board back-up capacitors to provide enough power to flush the contents of the controller's caches under all circumstances. This makes the controller caches effectively non-volatile and Flush can be safely treated as a no-op. Consumer drives don't have the luxury of adding the amount of capacitance needed to achieve the same guarantees. It adds cost and there is very limited room (or no room at all) on an M.2 PCB for the required capacitor placements. 

Whether enterprise or consumer grade, all SSDs complete writes when they arrive at the controller (not the NAND media). Waiting for the write to be committed to NAND would be a costly way to solve the Flush problem. Write latency would become dependent on host behavior. Ironically, slowly trickling writes would result in the highest latency. Imagine 4kB writes arriving 500 milliseconds apart. It would take 1.5 seconds to accumulate enough writes to fill a typical NAND page. This starts creating issues with command timeouts. Also, all writes would incur the 500+ microsecond page programming time, which could otherwise be pipelined away (at least until the NAND media bandwidth is saturated).  

So, how do we get clarity about Flush behavior? The answer from SSD vendors might be to buy a class of drive that supports back-up capacitors. From that perspective, it is a solved problem. There is certainly room for better cooperation between a host and an SSD without back-up capacitors to ensure that a Flush command is handled predictably, but these would require software changes. Perhaps the simplest solution is for consumer SSD vendors to offer different ways to handle Flush; one that is best effort and another that is an iron-clad guarantee, with the latter sacrificing on endurance warranty claims.

Then again, maybe it's simply ok to lie sometimes?

The LZ4 Trade-Off (Illustrated and Simplified)

LZ4, gzip (a.k.a. zlib or DEFLATE), and increasingly Zstandard are the dominant choices for general-purpose, lossless compression. It's ...