On Thu, 19 May 2011 00:26:26 BST, Heath Jones said:
I wonder if this is possible:
- Take a hash of the original file. Keep a counter. - Generate data in some sequential method on sender side (for example simply starting at 0 and iterating until you generate the same as the original data) - Each time you iterate, take the hash of the generated data. If it matches the hash of the original file, increment counter. - Send the hash and the counter value to recipient. - Recipient performs same sequential generation method, stopping when counter reached.
MD5 is a 128 bit hash. 2^128 is 340,282,366,920,938,463,463,374,607,431,768,211,456 - you're welcome to iterate that many times to find a duplicate. You may get lucky and get a hit in the first trillion or so attempts - but you may get unlucky and not get a hit until the *last* few trillion attempts. On average you'll have to iterate about half that huge number before you get a hit. And it's lossy - if you hash all the possible 4K blocks with MD5, you'll find that each of those 2^128 hashes has been hit about 256 times - and no indication in the hash of *which* of the 256 colliding 4K blocks you have on this iteration. (The only reason that companies can do block-level de-duplication by saving a hash as an index to one copy shared by all blocks with the same hash value is because you have a *very small* fraction of the possibilities covered, so if you saved a 4K block of data from somebody's system32 folder under a given MD5 hash, it's *far* more likely that another block with that same hash is from another copy of another identical system32 folder, than it is an actual accidental collision.) Protip: A good hash function is by definition one-way - given the data, it's easy to generate the hash - but reversing it to find the "pre-image" (the data that *generated* the hash) is massively difficult.