Tuesday, January 10, 2017

Let's talk about hash collisions.

As I work I listen to podcasts. Today this episode of Skeptoid on password cracking was in my podcatcher. Brian Dunning is usually pretty good with his assertions, and I enjoy the show in general. However, in this episode he claims the best hashing algorithms produce unique hashes. Let's look into why that's not necessarily true.

What is a hash, you ask? In the real world we use hashes to save passwords, or to verify the file we downloaded was the one we expected, it's a core part of what makes virtual currencies like Bitcoin work; they pop up a lot a layer or two under the computing most people do.

How do hashes work? I'll describe it in broad, simple terms. Alice wants to be able to prove that Bob knows something in the future, but she doesn't want to know it herself. Alice asks Bob to write the thing he knows into her magic box, and the magic box prints out something that looks like random gibberish. But it's not. Every time Bob writes the thing he knows the box produces the same random-looking gibberish. Alice cannot decipher what Bob knows, but she knows that her box will turn it into the same thing every time.

The magic box is performing the function of a hashing algorithm. It's job is to keep Alice from knowing what Bob knows, while allowing her to verify that Bob knows the thing.

There are two problems with this box, in a practical sense. First, if not bounded the output of the hash could grow prohibitively large. So, modern hash functions map data to a fixed size (say, 64 characters) to allow them to be stored and compared easily. This amplifies our second problem: collisions happen.

What is a collision?

Say Bob enters "taco" into Alice's magic box. The box prints out "fdsa342." Then Claire enters "The breath of the morning I keep forgetting the smell of the warm summer air." And the box prints out "fdsa342" again! Confused, Alice enters "Is this thing broken?" into the box, and it spits out "tkb432s."

The box is working as intended, so what happened?

Math happened. To distill the problem down to the point of absurdity, 2+2 equals 4, but 1+3 also equals 4. Now, hashing functions - particularly modern hashing functions - are way more complicated than that, but the problem is the same. If you permitted your hashing function to produce a value of any size, you could probably come up with a way to avoid hitting the same number twice. But it wouldn't be particularly usable in the real world we have to limit the size of the output of our hash function. This leads to what mathematicians call "The Pigeonhole Principal." In easy to understand terms, if you have 9 pigeons and 8 pigeonholes, at least two pigeons will share a pigeonhole.

So, since we have (the arbitrary) 64 characters worth of space to work with we have approximately 126,886,932,185,884,164,103,433,389,335,161,480,802,865,516,174,545,192,198,801,894,375,214,704,230,400,000,000,000,000. That's a lot of pigeonholes! But it's not infinite. Once we produce (!64)+1 combinations of data, we will have a guaranteed collision. As long as we allow humans and their tools to produce infinite combinations of data, and require hashes to produce values of a fixed size, we will have collisions.

There are ways to limit the probability of collisions, and we take advantage of (some of) those, but we cannot eliminate the problem completely.

The good news is that it is difficult to find useful collisions. The MD5 hashing algorithm is about as broken as anything in use today. You can read a technical explanation of the process of producing collisions for that algorithm here.

I hope you learned a bit more about what collisions are, why they're unavoidable, and why they aren't always the end of the world.