Looking for the best algorithm to take a file, split it into N parts, add M redundancy parts and then store the file in N+M different locations. Files would typically be large.
For example: a 1GB file could be split into (32) 32MB parts, have (8) additional 32MB parts computed, and the 1.25GB redundant structure stored out on 40 different areas. The goal would be to recreate the file from any (32) valid parts. An independant hash of the original (32) parts would be available for sanity checking correct reconstruction.
If this is doable, I believe this would provide the functional equivalent of having 8 mirror copies for an overhead of only 25% (plus computation time), would it not?
I found the 1989 Rabin algorithm which appears to do this. Was wondering if anyone was aware of something better/faster?
I recognize this is similar to how Raid 5 and Raid 6 work - looking to take that approach, extend it to 8+ parity blocks, and perform it at the file level.
The way you've defined the problem is indeed called secret sharing, but there have been improvements made since 1989 -- we no longer have to specify M in advance.
Now, you can generate an essentially limitless sequence of blocks for a file, and regenerate the file from any collection of them as long as their lengths add up to the length of the original file.
It's called a "fountain code": https://en.wikipedia.org/wiki/Fountain_code . The "raptor codes" are the best today, I believe: https://en.wikipedia.org/wiki/Raptor_code . They were the first to provide linear time encoding and decoding.
Note that raptor codes are not entirely reliable -- occasionally you may need to collect an extra block -- but for the use cases they support that's a small price to pay for linear time decoding.
If you need to be absolutely sure that you can decode the message given the minimal number of blocks, then I've found that secret sharing using the Chinese Remainder Theorem (described here: https://en.wikipedia.org/wiki/Secret_sharing#Using_the_Chinese_remainder_theorem), but using irreducible polynomials in GF(2^N) instead of prime integers, extends easily into a fountain code, but requries (N^2) time to encode and decode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With