I have ordered chunks of data, each hashed individually with sha256. I want to combine those hashes into one sha256 hash. Should I just feed the hashes into sha256 as data, or is there another way that's better from a math/crypto standpoint? It might seem like a trivial question, but intuitions are often wrong when it comes to crypto.
edit: The purpose of this is to form a sort of a blockchain although that term is pretty overloaded these days. It's for integrity purposes, not proof of work. The idea is hash the blocks at the follower nodes, combine the hashes into one on the cluster leader to have a hash representing the chain as a whole, and then prepend that to the new blocks to be hashed.
It's a little odd in that it's a distributed system so "whole chain hash" is usually a little stale so I know what the hash representing the chain, as known to that node, when the block was created at that node, but there could be several blocks that "hook onto the chain" at that particular hash, and then those are ordered and combined into the system hash which gets prepended to new blocks eventually.
I'm using Go, if that matters.
If you are trying to recreate the hash of a large payload (e.g. 1GB file) that has been split into chunks (e.g. 10MBs in size), the hash (MD5, SHA-256 etc.) needs to be computed on the entire collection. So using this example, you cannot add the 100 chunked hashes to recreate the hash of the original file. However...
You could send 2 values with each chunk:
As the chunks are streamed in, one can verify the seams of the hash state at the end of chunk N
matches that of the hash state at the beginning of chunk N+1
.
The final hash state of the final chunk will be the hash for the entire payload.
Why do it like this? Because the hash can be computed in realtime as the file chunks are received - rather as a separate time-consuming process - after all the file chunks have been received.
Edit: based on comments:
Here's a crude state hash state solution:
Create a large random file (100MB):
dd if=/dev/urandom of=large.bin bs=1048576 count=100
Using an external tool to verify hash:
$ shasum -a 256 large.bin
4cc76e41bbd82a05f97fc03c7eb3d1f5d98f4e7e24248d7944f8caaf8dc55c5c large.bin
Running this playground code on the above file.
...
...
...
offset: 102760448 hash: 8ae7928735716a60ae0c4e923b8f0db8f33a5b89f6b697093ea97f003c85bb56 state: 736861032a24f8927fc4aa17527e1919aba8ea40c0407d5452c752a82a99c06149fd8d35000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006200000
offset: 103809024 hash: fbbfd2794cd944b276a04a89b49a5e2c8006ced9ff710cc044bed949fee5899f state: 73686103bdde167db6a5b09ebc69a5abce51176e635add81e190aa64edceb280f82d6c08000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006300000
offset: 104857600 hash: 4cc76e41bbd82a05f97fc03c7eb3d1f5d98f4e7e24248d7944f8caaf8dc55c5c state: 73686103c29dbc4aaaa7aa1ce65b9dfccbf0e3a18a89c95fd50c1e02ac1c73271cfdc3e0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006400000
the final hash matches.
Trying with an offset and intermediate hash-state. The file will be seeked
to this offset, resuming the hash calculation from that point:
$ ./hash -o 102760448 -s "736861032a24f8927fc4aa17527e1919aba8ea40c0407d5452c752a82a99c06149fd8d35000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006200000"
offset: 103809024 hash: fbbfd2794cd944b276a04a89b49a5e2c8006ced9ff710cc044bed949fee5899f state: 73686103bdde167db6a5b09ebc69a5abce51176e635add81e190aa64edceb280f82d6c08000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006300000
offset: 104857600 hash: 4cc76e41bbd82a05f97fc03c7eb3d1f5d98f4e7e24248d7944f8caaf8dc55c5c state: 73686103c29dbc4aaaa7aa1ce65b9dfccbf0e3a18a89c95fd50c1e02ac1c73271cfdc3e0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006400000
we get the same final hash as before.
Note: this does expose the hash internal state, so be mindful of the security implications this may entail. With a large chunk-size, this should not be an issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With