My point here is it IS possible to transfer just a hash and counter value and effectively generate identical data at the remote end. The limit that will be hit is the difficulty of generating and comparing hash values with current processing power. I'm proposing iterating through generated data up until the actual data. It's not even a storage issue, as once you have incremented the data you don't need to store old data or hash values - just the counter. No massive hash tables. It's a CPU issue. On 19 May 2011 00:42, <Valdis.Kletnieks@vt.edu> wrote:
On Thu, 19 May 2011 00:26:26 BST, Heath Jones said:
I wonder if this is possible:
- Take a hash of the original file. Keep a counter. - Generate data in some sequential method on sender side (for example simply starting at 0 and iterating until you generate the same as the original data) - Each time you iterate, take the hash of the generated data. If it matches the hash of the original file, increment counter. - Send the hash and the counter value to recipient. - Recipient performs same sequential generation method, stopping when counter reached.
MD5 is a 128 bit hash.
2^128 is 340,282,366,920,938,463,463,374,607,431,768,211,456 - you're welcome to iterate that many times to find a duplicate. You may get lucky and get a hit in the first trillion or so attempts - but you may get unlucky and not get a hit until the *last* few trillion attempts. On average you'll have to iterate about half that huge number before you get a hit.
And it's lossy - if you hash all the possible 4K blocks with MD5, you'll find that each of those 2^128 hashes has been hit about 256 times - and no indication in the hash of *which* of the 256 colliding 4K blocks you have on this iteration. (The only reason that companies can do block-level de-duplication by saving a hash as an index to one copy shared by all blocks with the same hash value is because you have a *very small* fraction of the possibilities covered, so if you saved a 4K block of data from somebody's system32 folder under a given MD5 hash, it's *far* more likely that another block with that same hash is from another copy of another identical system32 folder, than it is an actual accidental collision.)
Protip: A good hash function is by definition one-way - given the data, it's easy to generate the hash - but reversing it to find the "pre-image" (the data that *generated* the hash) is massively difficult.