Sunday, February 23, 2025

The LZ4 Trade-Off (Illustrated and Simplified)

LZ4, gzip (a.k.a. zlib or DEFLATE), and increasingly Zstandard are the dominant choices for general-purpose, lossless compression. It's often stated that LZ4 is the most "CPU friendly" and thus fastest algorithm, but the trade-off is a lower compression ratio. On the other hand, gzip compresses better but is much slower. But what does "CPU friendly" mean and how does LZ4 compare to gzip and Zstandard?

Compression is an immensely complicated topic [1], but we can distill compression algorithms down to two primary techniques: 

  1. Eliminating repeated data
  2. Encoding data in a more compact format

In general, the first technique aligns well to modern CPUs while the second is more challenging. A "CPU-friendly" algorithm like LZ4 skips the second compression technique in order to avoid multiple passes over the data or working with non-byte aligned values. Instead, it only performs the first technique, working over a single pass through the data on byte-sized values that CPUs can process efficiently (especially with SIMD). It is this single-pass, byte oriented approach that makes LZ4 very fast. Let's explore the trade-offs with some examples.

Consider the string "The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep." We can use Python to write this to a binary file:

string = "The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep"
            
with open("string_to_compress.bin", "wb") as f:
        f.write(string.encode('utf-8'))

We can print the contents of the generated file with `strings`:

$ strings string_to_compress.bin
The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep

We can see that there is some repeated text. For example, "quick brown fox" and "the lazy dog" appear twice. LZ4 will replace the second occurrence of this text with a (smaller) reference to the first occurrence of the text. And we can observe this by using LZ4 to compress the data:

$ lz4 string_to_compress.bin
Compressed filename will be : string_to_compress.bin.lz4 
Compressed 112 bytes into 107 bytes ==> 95.54%  

If we dump the strings in the compressed file, we can clearly see where space was saved:

$ strings string_to_compress.bin.lz4 | tr '\n' ' '
&The quick brown fox jumps over the lazy dog. But that2 better be sure$  is asleep

In the compressed data, the second occurrences of "quick brown fox" and "the lazy dog" are indeed eliminated. But where LZ4 keeps it simple by swapping out repeats for pointers (or back references) to eliminate repeated data, algorithms like gzip and Zstandard take it further by adding the second general compression technique of optimizing how the data itself is written, squeezing it into a tighter format. Let’s see how that works with a different kind of data.

Let's consider data that only consists of 16 distinct byte values and create a 4kB file with them. 

byte_array = bytes([
    0x01, 0x23, 0x45, 0x67,
    0x89, 0xab, 0xcd, 0xef,
    0xaa, 0x55, 0x00, 0xff,
    0xde, 0xad, 0xc0, 0xbe
])

num_bytes_to_write = 4096  # 4kB 

with open("bytes_to_compress.bin", "wb") as f:
    for _ in range(num_bytes_to_write):
        byte = random.choice(byte_array)
        f.write(bytes([byte]))

Our intuition suggests that since the data consists of only 16 different byte values repeated many times in a 4kB file, it should be highly compressible. But if we attempt to compress the resulting data with LZ4, no space is saved.

$ lz4 bytes_to_compress.bin 
Compressed filename will be : bytes_to_compress.bin.lz4 
Compressed 4096 bytes into 4111 bytes ==> 100.37%    

On the other hand, compressing with gzip results in a file that is only 2347 bytes, a little more than half the original size. 

$ gzip -v bytes_to_compress.bin
bytes_to_compress.bin: 42.7% -- replaced with bytes_to_compress.bin.gz

The trick is that with only 16 distinct values, we do not need to represent each value with 8-bits. With only 16 choices, we should ideally only need log2(16) = 4 bits to uniquely identify all the possible values that could appear in the data. Let's use infgen [2] to analyze the contents of the gzip representation of the file. The infgen tool allows us to see (among other things) how many bits are being used to represent each value in our set of bytes.
 
litlen 0x00  4   # The value 0x00 is encoded in 4 bits
litlen 0x01  4   # The value 0x01 is encoded in 4 bits     
litlen 0x23  5   # The value 0x23 is encoded in 5 bits
litlen 0x45  5   # and so on... 
litlen 0x55  5
litlen 0x67  5
litlen 0x89  4
litlen 0xaa  5   
litlen 0xab  4
litlen 0xad  4
litlen 0xbe  4
litlen 0xc0  6
litlen 0xcd  5
litlen 0xde  4
litlen 0xef  5
litlen 0xff  5      

In the gzip representation of the data, each possible byte value is replaced with a shorter encoded value of between 4 and 6 bits (but not quite the ideal value of 4 due to overhead added in earlier phases of the compression algorithm). LZ4 simply skips this "variable length coding" exercise altogether. In this extreme case, it doesn't take advantage of the data compressibility at all.

Zstandard is similar to gzip in that it includes both general compression techniques; however, it swaps out the algorithm used for variable length encoding with one that is faster to execute on modern CPUs (among other fine tuning). One of the advantages of Zstandard is that it can be configured to trade-off compression ratio for improved speed over a very wide range; from near-LZ4 levels of throughput to better-than-gzip levels of compression. It's as close to a "one stop shop" for software compression as one can get.

Of course, no compression done in software comes without a performance penalty. LZ4 attempts to minimize the impact and achieves very high throughput; however, there are other side effects like cache pollution that are not easily captured in simple throughput measurements. This is something to be explored in another blog post.

[1] "Understanding Compression" offers a great overview of compression that is perfect for software developers. (https://www.oreilly.com/library/view/understanding-compression/9781491961520/)

[2] https://github.com/madler/infgen

No comments:

Post a Comment

The LZ4 Trade-Off (Illustrated and Simplified)

LZ4, gzip (a.k.a. zlib or DEFLATE), and increasingly Zstandard are the dominant choices for general-purpose, lossless compression. It's ...