Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How best to split a file into N parts with M redundant parts such that any N of the N+M parts is sufficient to reconstuct it?

Looking for the best algorithm to take a file, split it into N parts, add M redundancy parts and then store the file in N+M different locations. Files would typically be large.

For example: a 1GB file could be split into (32) 32MB parts, have (8) additional 32MB parts computed, and the 1.25GB redundant structure stored out on 40 different areas. The goal would be to recreate the file from any (32) valid parts. An independant hash of the original (32) parts would be available for sanity checking correct reconstruction.

If this is doable, I believe this would provide the functional equivalent of having 8 mirror copies for an overhead of only 25% (plus computation time), would it not?

I found the 1989 Rabin algorithm which appears to do this. Was wondering if anyone was aware of something better/faster?

I recognize this is similar to how Raid 5 and Raid 6 work - looking to take that approach, extend it to 8+ parity blocks, and perform it at the file level.

like image 386
Cassey Avatar asked Nov 15 '25 21:11

Cassey


1 Answers

The way you've defined the problem is indeed called secret sharing, but there have been improvements made since 1989 -- we no longer have to specify M in advance.

Now, you can generate an essentially limitless sequence of blocks for a file, and regenerate the file from any collection of them as long as their lengths add up to the length of the original file.

It's called a "fountain code": https://en.wikipedia.org/wiki/Fountain_code . The "raptor codes" are the best today, I believe: https://en.wikipedia.org/wiki/Raptor_code . They were the first to provide linear time encoding and decoding.

Note that raptor codes are not entirely reliable -- occasionally you may need to collect an extra block -- but for the use cases they support that's a small price to pay for linear time decoding.

If you need to be absolutely sure that you can decode the message given the minimal number of blocks, then I've found that secret sharing using the Chinese Remainder Theorem (described here: https://en.wikipedia.org/wiki/Secret_sharing#Using_the_Chinese_remainder_theorem), but using irreducible polynomials in GF(2^N) instead of prime integers, extends easily into a fountain code, but requries (N^2) time to encode and decode.

like image 147
Matt Timmermans Avatar answered Nov 17 '25 12:11

Matt Timmermans



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!