Wednesday, April 20, 2011

What is the best compression algorithm for small 4kb files?

I am trying to compress TCP packets each one of about 4kb in size. The packets can contain any byte (from 0 to 255). All of the benchmarks on compression algorithms that I found were based on larger files. I did not find anything that compares the compression ratio of different algorithms on small files, which is what I need. I need it to be open source so it can be implemented on C++, so no RAR for example. What algorithm would you recommend for small files of about 4 kilobytes in size? lzma? hacc? zip? gzip2?

From stackoverflow
  • All of those algorithms are reasonable to try. As you say, they aren't optimized for tiny files, but your next step is to simply try them. It will likely take only 10 minutes to test-compress some typical packets and see what sizes result. (Try different compress flags too). From the resulting files you can likely pick out which tool works best.

    The candidates you listed are all good first tries. You might also try bzip2.

    Sometimes simple "try them all" is a good solution when the tests are easy to do.. thinking too much sometimes slow you down.

    Blorgbeard : I agree, and ask that you post your results when you're done :)
  • For small packets biggest difference is achieved by Huffman-like distribution encodings since most used byte values automatically consume the least space. If you apply a dictionary based compression (LZ variants) on top of it you would have a very decent compression running.

  • I don't think the file size matters - if I remember correctly, the LZW in GIF resets its dictionary every 4K.

  • ZLIB should be fine. It is used in MCCP.

    However, if you really need good compression, I would do an analysis of common patterns and include a dictionary of them in the client, which can yield even higher levels of compression.

  • I did what Arno Setagaya suggested in his answer: made some sample tests and compared the results.

    The compression tests were done using 5 files, each of them 4096 bytes in size. Each byte inside of these 5 files was generated randomly.

    IMPORTANT: In real life, the data would not likely be all random, but would tend to have quiet a bit of repeating bytes. Thus in real life application the compression would tend to be a bit better then the following results.

    NOTE: Each of the 5 files was compressed by itself (i.e. not together with the other 4 files, which would result in better compression). In the following results I just use the sum of the size of the 5 files together for simplicity.

    I included RAR just for comparison reasons, even though it is not open source.

    Results: (from best to worst)

    LZOP: 20775 / 20480 * 100 = 101.44% of original size

    RAR : 20825 / 20480 * 100 = 101.68% of original size

    LZMA: 20827 / 20480 * 100 = 101.69% of original size

    ZIP : 21020 / 20480 * 100 = 102.64% of original size

    BZIP: 22899 / 20480 * 100 = 111.81% of original size

    Conclusion: To my surprise ALL of the tested algorithms produced a larger size then the originals!!! I guess they are only good for compressing larger files, or files that have a lot of repeating bytes (not random data like the above). Thus I will not be using any type of compression on my TCP packets. Maybe this information will be useful to others who consider compressing small pieces of data.

    EDIT: I forgot to mention that I used default options (flags) for each of the algorithms.

    kquinn : Your test is pretty worthless. Just about *any* compression algorithm will choke on random data -- in fact, compression ratio is a useful test for *how random* a chunk of data is -- if "compressing" enlarges data, it's probably high-entropy. Try again with real data and you might get useful results.
    Rick C. Petty : I agree that the test is worthless. Randomly-distributed data will not compress, in fact the basis of most compression algorithms is that the data is not random. Also, your comparison does not include zlib which only adds 5 bytes every 64k when STORE is used instead of DEFLATE.
    derobert : Compression is not magic. It works by observing repeating patterns. Random data has no repeating patterns, and will thus not compress. It can not, as 8^4096 > 8^4095.
  • I've had luck using zlib compression libraries directly and not using any file containers. ZIP, RAR, have overhead to store things like filenames. I've seen compression this way yield positive results (compression less than original size) for packets down to 200 bytes.

  • Here are some questions to ponder!!!

    Are you transmitting the dictionary within each packet?

    How big is the dictionary for each file?

    Is the data really random?

    I would suggest you analyze your data and build a static dictionary that the receiving 'knows' or at least can be updated occassionally.

    This will save considerable space and transmission time. There is no reason why this dictionary can't be huge compared to you packet size of 4K. ie 32MB.

    What's the best way to transfer the contents of the phone book from A to B? Tell B to pick his copy and use it!!!

  • Choose the algorithm that is the quickest, since you probably care about doing this in real time. Generally for smaller blocks of data, the algorithms compress about the same (give or take a few bytes) mostly because the algorithms need to transmit the dictionary or Huffman trees in addition to the payload.

    I highly recommend Deflate (used by zlib and Zip) for a number of reasons. The algorithm is quite fast, well tested, BSD licensed, and is the only compression required to be supported by Zip (as per the infozip Appnote). Aside from the basics, when it determines that the compression is larger than the decompressed size, there's a STORE mode which only adds 5 bytes for every block of data (max block is 64k bytes). Aside from the STORE mode, Deflate supports two different types of Huffman tables (or dictionaries): dynamic and fixed. A dynamic table means the Huffman tree is transmitted as part of the compressed data and is the most flexible (for varying types of nonrandom data). The advantage of a fixed table is that the table is known by all decoders and thus doesn't need to be contained in the compressed stream. The decompression (or Inflate) code is relatively easy. I've written both Java and Javascript versions based directly off of zlib and they perform rather well.

    The other compression algorithms mentioned have their merits. I prefer Deflate because of its runtime performance on both the compression step and particularly in decompression step.

    A point of clarification: Zip is not a compression type, it is a container. For doing packet compression, I would bypass Zip and just use the deflate/inflate APIs provided by zlib.

  • You can try delta compression(http://en.wikipedia.org/wiki/Delta_encoding). Compression will depend on your data. If you have any encapsulation on the payload, then you can compress the headers.

  • If you want to "compress TCP packets", you might consider using a RFC standard technique.

    • RFC2394 IP Payload Compression Using DEFLATE
    • RFC2395 IP Payload Compression Using LZS
    • RFC3173 IP Payload Compression Protocol (IPComp)
    • RFC3051 IP Payload Compression Using ITU-T V.44 Packet Method
    • RFC5172 Negotiation for IPv6 Datagram Compression Using IPv6 Control Protocol
    • RFC5112 The Presence-Specific Static Dictionary for Signaling Compression (Sigcomp)
    • RFC3284 The VCDIFF Generic Differencing and Compression Data Format
    • RFC2118 Microsoft Point-To-Point Compression (MPPC) Protocol

    There are probably other relevant RFCs I've overlooked.

0 comments:

Post a Comment