I'm using the DotNetZip library to compress a data stream in a Zip file for storage. DotNetZip is able to compress multi-threaded, and it's nice & fast.
All of the Libraries I've found are Single Threaded for Decompression.
Is this a shortcoming of the ZIP format in general? Is there a multi-threaded Unzip function in the .Net world? (With a Stream interface?)
If not.. are there technical reasons why this can't be implemented?
Additional Info: The data being compressed is SQL Server database backups ~ 30 Gb in size, being streamed from a SQL Server Backup command (VDI) through a ZipOutputStream to a FileStream.
It's not a technical impossibility.
DotNetZip doesn't do multi-threaded decompression because I never implemented it. MT compression was the priority; I did that. I just haven't bothered to do the MT decompression. Compression is generally a more CPU-intensive and expensive operation than decompression; this is particularly true with DEFLATE, the typical compression algorithm used in ZIP archives, because of the search requirement. Though I am not a compression algorithm expert, I'd guess that a similar characteristic would apply to other compression algorithms. There's no need to search during decompression, and as a result decompression is generally relatively faster. For that reason optimizing decompression in DotNetZip was less of a priority.
A side note: the parallel compression in DotNetZIp is done on a single file: suppose you have a file of 1000 blocks (for arbitrary block length). DotNetZip will enlist multiple threads in compression, each thread compressing one block. Because the compressor threads run independently, it's possible that the compression for block 6 will finish before the compression for block 4, for instance. The main thread therefore is responsible for re-assembling the compressed blocks back into the proper order, and then writing them to the output stream.
In this way, each entry (file) in a zip archive is compressed completely, before the library begins compressing the next entry. There is an obvious opportunity to apply an additional level of parallelism during compression: compressing multiple entries in parallel. DotNetZip doesn't do this now. This approach to parallelism would make sense when the zipfile being created consists of larger numbers of smaller files, whereas the parallel compression DotNetZip does today, makes sense when the zip archive contains any number of larger files (larger than 512k or so).
Using DotNetZip today, on a typical modern laptop, the CPU gets saturated when compressing large files, those that have more than 10 or so blocks, where the typical block size is 512k. So adding the new level of parallelism wouldn't speed up that scenario at all. But it would help the scenario of compressing, let's say, 70,000 small files into a single archive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With