On Wed, May 18, 2011 at 10:42 PM, Heath Jones <hj1980@gmail.com> wrote:
Example: I want to send you the number 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000. The MD5 hash of this is f59a3651eafa7c4dbbb547dd7d6b41d7. I generate data 0,1,2,3,4,5.. all the way up to 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000, observing the hash value of the data just generated each time. Whenever the hash matches f59a3651eafa7c4dbbb547dd7d6b41d7 , I increment a counter. Once I have reached the number I want to send you, I send the hash value and the counter value.
You perform the same function starting at 0 and working your way up until you have a matching counter value. The number of collisions in the range 0 -> target is represented by the counter value, and as long as both sides are performing the same sequence this will work.
Obviously this is completely crazy and would never happen with current processing power... It's just theoretical nonsense, but answers the OP's question.
The point here, however, is that for most cases, the hash f59a3651eafa7c4dbbb547dd7d6b41d7 and the length of the counter will amount to sending more data than just transmitting the number. For example, MD5 is a 16 byte hash. For a 20 byte value, you'll need to transmit a 16 byte hash, plus a counter. On average, of all 20 byte values, where there are 2^128 hash outputs, each hash output will have 2^32 possible input values, and a counter to store that will need to be 32 bits long. So you're back where you started. Sometimes this might even take more space, say if there are 0 strings which hash to 0000...0 and 2^33 which hash to 0000...1 and 2^32 that hash to each other value, if you're trying to transmit the last item that hashes to 0000...1, you'll actually need more space. Your compression algorithm just became an inflation algorithm. --Dan