I have been studying LZW compression and there is one thing that I can't satisfy myself with which is that while building up the dictionary in LZW mostly its maximum limit is set to 4096 entries. Why is that ?. Also if dictionary gets full then the dictionary is reset but what if the next few characters about to read were present in the dictionary before resetting the dictionary. Is this a limitation ? or my understanding is not correct?
The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.
However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.
It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.
When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.
The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.
Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.
1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.
2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With