LZ4, gzip (a.k.a. zlib or DEFLATE), and increasingly Zstandard are the dominant choices for general-purpose, lossless compression. It's often stated that LZ4 is the most "CPU friendly" and thus fastest algorithm, but the trade-off is a lower compression ratio. On the other hand, gzip compresses better but is much slower. But what does "CPU friendly" mean and how does LZ4 compare to gzip and Zstandard?
Compression is an immensely complicated topic [1], but we can distill compression algorithms down to two primary techniques:
- Eliminating repeated data
- Encoding data in a more compact format
In general, the first technique aligns well to modern CPUs while the second is more challenging. A "CPU-friendly" algorithm like LZ4 skips the second compression technique in order to avoid multiple passes over the data or working with non-byte aligned values. Instead, it only performs the first technique, working over a single pass through the data on byte-sized values that CPUs can process efficiently (especially with SIMD). It is this single-pass, byte oriented approach that makes LZ4 very fast. Let's explore the trade-offs with some examples.
Consider the string "The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep." We can use Python to write this to a binary file:
with open("string_to_compress.bin", "wb") as f:
f.write(string.encode('utf-8'))
The quick brown fox jumps over the lazy dog. But that quick brown fox better be sure that the lazy dog is asleep
We can see that there is some repeated text. For example, "quick brown fox" and "the lazy dog" appear twice. LZ4 will replace the second occurrence of this text with a (smaller) reference to the first occurrence of the text. And we can observe this by using LZ4 to compress the data:
Compressed filename will be : string_to_compress.bin.lz4
Compressed 112 bytes into 107 bytes ==> 95.54%
If we dump the strings in the compressed file, we can clearly see where space was saved:
&The quick brown fox jumps over the lazy dog. But that2 better be sure$ is asleep
In the compressed data, the second occurrences of "quick brown fox" and "the lazy dog" are indeed eliminated. But where LZ4 keeps it simple by swapping out repeats for pointers (or back references) to eliminate repeated data, algorithms like gzip and Zstandard take it further by adding the second general compression technique of optimizing how the data itself is written, squeezing it into a tighter format. Let’s see how that works with a different kind of data.
Let's consider data that only consists of 16 distinct byte values and create a 4kB file with them.
In the gzip representation of the data, each possible byte value is replaced with a shorter encoded value of between 4 and 6 bits (but not quite the ideal value of 4 due to overhead added in earlier phases of the compression algorithm). LZ4 simply skips this "variable length coding" exercise altogether. In this extreme case, it doesn't take advantage of the data compressibility at all.
Zstandard is similar to gzip in that it includes both general compression techniques; however, it swaps out the algorithm used for variable length encoding with one that is faster to execute on modern CPUs (among other fine tuning). One of the advantages of Zstandard is that it can be configured to trade-off compression ratio for improved speed over a very wide range; from near-LZ4 levels of throughput to better-than-gzip levels of compression. It's as close to a "one stop shop" for software compression as one can get.
Of course, no compression done in software comes without a performance penalty. LZ4 attempts to minimize the impact and achieves very high throughput; however, there are other side effects like cache pollution that are not easily captured in simple throughput measurements. This is something to be explored in another blog post.
[1] "Understanding Compression" offers a great overview of compression that is perfect for software developers. (https://www.oreilly.com/library/view/understanding-compression/9781491961520/)
No comments:
Post a Comment