So with the release of Quickhash v2.8.0 (arguably the most significant release of the program since v2.0) came the super fast xxHash algorithm. Many of you may be wondering what it is and why I added it. But I also know many of you who use QuickHash are probably “hash enthusiasts” who always like to check that File A in front of you is the same as FileA over there with utter certainty and there is no room for any doubt. You’re also probably hash purists. “MD5 is broken…SHA-1 is not much better, you should only ever use SHA-256…” blah blah blah.
Maybe you’re right. Maybe you’re not. Maybe you read some academic paper about hash collisions of 5 bytes of data (which of course makes the issue significant for password security) without fully grasping the concept of engineered collisions and the still high likelihood that an MD5 of any volume data over a few Mb is extremely unlikely. If the above sounds like you, then xxHash is probably not for you.
But, I am a developer. And developers have to take everyones needs into consideration in order to make something useful to the general populous. Quickhash is used all over the world by all kinds of people in every industry I can think of that uses data checks. And that is why I have incorporated xxHash – for speed!
xxHash is (to quote the developer Yann Collet) : “an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical on all platforms (little / big endian).”
SMHasher is: “SMHasher is a test suite designed to test the distribution, collision, and performance properties of non-cryptographic hash functions“.
xxHash scores 10 for quality in that test, the same as MD5 and SHA-1. But xxHash is not a cryptographic hash function like the others. I read somewhere that the “uniqueness” of an xxHash is about the same as MD5. So not as unique as SHA-1, but about the same as MD5. So if you don’t like MD5, you probably won’t like xxHash either. But the speed difference maybe enough to make you want to consider it. Read on…
It is important to stress the importance here of 32 vs 64 bit compilation. For the other 4 hash algorithms incorporated into QuickHash, CPU architecture makes little difference to performance because memory usage is the same – QuickHash only ever uses a few Mb of RAM. But xxHash is different when it comes to CPU utilisation and for that reason, the user is shown the most suitable xxHash for their given platform inside QuickHash.
Ideal World Benchmarks
32-bit xxHash Implementation
The following numbers are “ideal world, ideal hardware, perfectly coded application etc etc”. On a high-speed modern hardware setup where everything is aligned perfectly for maximum speed (high speed transfer buses, SSDs etc), 32-bit applications (for which QuickHash is available) will compute MD5 theoretically at about 0.33Gb\s; around 20Gb a minute. This is true for most hashing tools – not specifically QuickHash, simply because of how MD5 works. For SHA-1, it is about 0.28Gb\s; around 16Gb a minute. When using xxHash, the speed will conceivably be around the 5.4Gb\s mark on the same system! So around 324Gb a minute! That’s about 15 times faster. But, as I state, these are theoretical values. In real life, with running operating systems, other resources pulling at the CPU etc, life is different.
Anyway, given these theoretical nirvana’s, if hashing a 500Gb disk with MD5 takes 25 minutes ((500Gb divide 20Gb\s) on a 32-bit system, you can realistically expect a xxHash result on that same system in about 1.5 minutes (500Gb divide 324Gb\s)! That is, according to the public benchmarks, and that is probably true in a professionally coded system.
64-bit xxHash Implementation
On 64-bit systems, MD5 will compute at about the same speed (multi-threaded programs aside) of 0.33Gb\s but the 64-bit implementation of xxHash will compute at about 13Gb\s (780Gb p\minute)! That’s about 30 times faster than MD5 and 2.5 times faster than the already very fast 32-bit xxHash algorithm!
So, if hashing a 500Gb disk with MD5 takes 25 minutes ((500Gb divide 20Gb\s) on a 64-bit system (about the same as 32-bit system), you can realistically expect a result of a 64-bit implementation of xxHash on a 64-bit system in about 38 seconds (500Gb divide by 13Gb\s)! It doesn’t sound feasible but they are the numbers. Can a 500Gb disk even read that fast, even an SSD?
Anyway, so, 25 minutes for MD5 vs 38 seconds for an xxHash.
Sounds to be good to be true!? So what’s the downside? Why isn’t everyone using it? Well it’s not been fully documented yet and I’m not sure where these benchmarks come from specifically or what hardware rig is used. The source code ‘is’ the documentation, and until that is addressed use of xxHash in science critical areas may take a while to adopt. In legal circles, even more so. And people trust certain hash algorithms. Trust has to be earned and to my knowledge xxHash is not in wide global use at the moment so that global trust is still building I suspect. It’s also fairly new when compared to the years that MD5, SHA-1, SH256 have been on the scene. But I’m a big believer in trying something that breaks the mould and giving it a chance to flourish. That’s why I have added xxHash to Quickhash but not made it the default algorithm, yet. Maybe one day, xxHash will be the default!
If you’re a home user who wants to check that the xxHash of your DD image matches the restore of your disk that you’ve just restored your DD image to, then why not just use xxHash to check? xxHash your DD image file using the “File” tab of QuickHash (has to be a single DD image file, not a segmented one) and then xxHash your disk using the “Disks” tab of QuickHash. It will take a lot less time than any of the other algorithms and is probably not as critical as say court evidence or something like that.
Or if you’ve just downloaded a 4Gb ISO of a Linux distribution, why wait 2 or 3 minutes for a SHA-1 hash when you could have an xxHash in about 1 minute? Well, again, for now, you won’t have a choice there I’m afraid until attitudes change because website administrators always use MD5 or SHA-1 or SHA256 for ISOs. So you have to compare what you have with what you have been given. You can’t compare an apple against an orange. And you can’t compare an xxHash against a SHA-1 hash. But again, with tools like QuickHash and others like it implementing xxHash, maybe the tides will turn.
An important point is that although it is possible to compile the 64-bit implementation of xxHash for use on 32-bit systems, it will run at about 1.9Gb\s on that 32-bit system instead of potentially around 13Gb\s on a true 64-bit system. Whereas the 32-bit implementation of xxHash will run on a 64-bit system at about 6.8Gb\s and on a 32-bit system at about 6Gb\s. In other words, although both implementations will ‘work’ on both CPU architectures, there is a huge difference of the 64-bit implementation if it is run on a 32-bit system vs a 64-bit one. For that specific reason, QuickHash will make available to you the version most suitable to your CPU. If you run it on a 32-bit system, you will be shown “xxHash32” in the options. If you run it on a 64-bit system, you will be shown “xxHash64” in the options.
Real life benchmark
Also, by way of a quick demo, I used the 32-bit version of QuickHash and the SHA-1 algorithm to hash a 750Gb Western Digital SATA disk, traditional spinning platter, acting as a slave drive and holding a Windows 10 pagefile. It was not isolated from the system, no write blockers or any of those other things that make life faster. The operating system was Windows 10 64-bit running on a ten year old HP workstation (yes, ten years old!) with two AMD Opteron 2.44Ghz CPU’s and 64Gb RAM.
It was hashed in 3:33 with SHA-1.
The same disk hashed using QuickHash 64-bit and the xxHash64 algorithm (and after a reboot to flush and drive cache) was hashed in 2:21. So that’s about 40% faster using xxHash than using SHA-1 on this particular computer and examining this particular disk. So not quite the speeds of an ideal world but still a good result.
With files, on a running 64 bit system, I didn’t notice a huge difference. A 1.5Gb movie file took 15 seconds with SHA-1 and 12 seconds with xxHash. This is nowhere near what I was expecting given the published benchmarks. On the other hand, on a 32-bit system, a 500Mb file took 5 seconds with SHA-1 and 2 seconds with xxHash! So there are clearly some interesting results to be had in more controlled environments. Or maybe I have just implemented xxHash really bad. Or maybe it was just my system or something like that – I did have VMware running in the background after all! I’d be interested to hear your views with it by commenting below and any of your benchmarks.
Small error in the comparisions, twice you say “MD5 takes 25 minutes ((500Gb divide 20Gb\s” but you also say MD5 hashes at the speed of 0.333Gb\s which is 20Gb\minute not second. Cheers for the write up otherwise, interesting stuff
I tried xxHash 32-bit on ~400k 4kB pages both individually in random order and as a contiguous memory region. That should be about ~1.5-1.6 GB worth of space. The algorithm was pretty much instant in both cases, so as hashman said the algorithm is not the bottleneck, the I/O is. Note that this was with the 32-bit version.
At such speeds, limitations come from the I/O (HDD or SSD or network).
xxHash is going to be faster than any of them, and is just waiting for more data to read and hash.