This thread is a continuation of a thread at the TDS forum in which was written about the "security" of HASH-algorithms. I'm talking about this thread: http://www.wilderssecurity.com/showthread.php?t=7202 I will try to copy the more general parts from that thread to this one.
Quote from Wayne: [hr] CRC32 is an excellent 32-bit checksum - its small, very fast, and the chance of two files having the same CRC32 checksum are 1 in 2^32 (4,294,967,296). All file integrity checkers are trying to determine if 1 or more bits in the file has changed - you don't need 128bit algorithms to do this, it's actually overkill. Even a 16-bit checksum would suffice for this kind of simple task, but 32-bit is ideal. For a trojan to modify a CRC32-checked file (ie. c:\autoexec.bat so it could add an autostart entry for itself) it would have to continuously modify the file (in memory), CRCing it each time, up to 4,294,967,296 times, until it found a match of the original. The file might be huge (up to 4 gigabytes) and it may corrupt the existing file due to random data. 4 billion CRC calculations will take a fair while as far as trojans go, generating the random data also slows it down, and as the other file with the same CRC could be anything up to 4GB in size, the trojan would have no choice but to use disk space, and disk I/O is always slow. Is all of that slow brute-forcing really worth hiding a modification in one file?
Quote from Joseph: [hr] Well, you got the first part right -- CRC32 is indeed an excellent 32-bit checksum; if it wasn't, it would have been superseded by some other 32-bit hashing function. Unfortunately, you got the second part wrong. What you displayed is actually the probability that if you randomly generate a 32-bit checksum and then select a file (randomly), the probability that it will match the arbitrary checksum. Well, that's not the issue. We ran into the same false logic when developing the FBI's DNA database -- this is exactly what they wanted to do. Unfortunately, it doesn't quite work that way. In that instance, the REAL question was what's the probability that a DNA profile based on forensic evidence from a crime scene may be matched by two or more individual's DNA? Now, given that approximately 1% of Caucasians are in fact identical twins, this obviously isn't even close to being correct. (There are a lot of other complications in DNA due to genetic inheritance, but we don't need to go there.) There are two fundamental problems with your second statement. First, CRC32 is actually only a pseudorandom generator (albeit a very nice one), but the odds aren't quite up to 1 in 232. And, in actuality, given that most files tend to be of finite size, they don't even approximate this level. At best, the question is: What is the probability that two (or more) files on a machine will generate the same CRC32 value (purely randomly)? This probability is much higher. (Indeed, it follows the same logic related to the 'Birthday problem': What's the probability that two individuals in a class will have the same date of birth (excluding year considerations)? If we assume that there are only 365 possible birthdates in a year, the odds become even when the class size exceeds 23 individuals. It surpasses 90% if there are approximately 40 people in the class. Now, I can't speak for others, but I have over 10,000 executable files on this clunky old Win 98 SE machine. The possibility that two (or more) of them would generate the same CRC32 hash is far higher than what you indicate. Indeed, this is part of the reason that even Zone Alarm went to MD5 (128-bit hash algorithm) and that Microsoft went even further to SHA1 (160-bit hash algorithm). And, very shortly, I expect that we will be seeing routine use of 256-bit and 320-bit hash algorithms to establish authenticity. Quite frankly, a lot of searches for file duplicates (regardless of how they may be named or what file extension they may use) use these higher-order hash algorithms for precisely this reason. Still, not finished. All we've been talking about (so far) is whether two, perfectly innocuous, files are likely to generate the same CRC32. Now, let's talk about someone trying to maliciously 'match' a file by giving it the same filename/fileext and CRC32 hash. Last time I checked (must have been about two years ago), that could be done in about two hours on a desktop PC. Well, this is what malicious code is all about, now, isn't it? They want to pass the 'duck test'. And, at this point, all the nice fine statistical analyses about randomly generated hashes goes right out the window. Bad scenario, Wayne. Nobody in their right mind is going to try to 'alias' an individual user's Autoexec.bat file. They're going to go after something like explorer.exe, iexplore.exe, kernel32.dll, mprexe.exe, msimn.exe, msmgs.exe, mstask.exe, outlook.exe, pstores.exe, rnapp.exe, rpcss.exe, rundll32.exe (just to name some options on Win 98 SE, which I'm using as I write this). I think you can easily guess which ones I'd target -- the ones that the user doesn't typically see and thereby notice that it doesn't quite function normally anymore. They can do this before they distribute their little nasty, no need to run the junking algorithm on the target machine. (Exactly how they get it on the end-user's box is an entirely different matter, but I think we both know how this could be done.) The problem that they confront in successfully accomplishing this is directly related to whether they have to anticipate CRC32 (32-bit), MD5 (128-bit), or even SHA1(160-bit) Let's ignore the matter of more sophisticated hashes for the moment. As I noted in my original posting, methods of generating an app with a CRC32 identical to that of an existing valid application have been known for at least ten years; even MD5 has been cracked since (what?) 1996 or 1997.
Quote from Joseph: [hr] Oops, forgot to include the relevant algorithm: I believe that it runs as follows: Given that a particular hash algorithm can truly and randomly generate n distinct hashes (232 for CRC32, 2128 for MD5, and 2160 for SHA1), and that there are k executable files on a particular machine, then the probability that two or more executables will exhibit the same hash are: 1- n!/((n-k)!*nk) For the reasons noted above, this is a very conservative estimate, but it is far greater than 1/n in most practical instances. (Don't try running that formula on a standard PC; you'll get overflow and horrendous rounding errors; this is just the mathematical formula.)
Quote from Jason: [hr] You provide a lot of good points Joseph. I just want to provide a bit more information in regards to MD5 being "cracked". Given a perfect HASH for a given bitsize, lets say a perfect 128bit hash, the number of collisons in an amount of data is 2 ^ ( sizeof(DATA)-sizeof(HASH) )bits A collision occurs when some DATA has a HASH the same as another piece of DIFFERENT DATA. So if we have 17bytes (136bits) of DATA and we are hashing it with an 16byte hash (128bit) the number of collisions is 256 ( 136-128 = 8, hence 2^. So if we used a perfect 128bit hash on a 17 byte password there are 255 OTHER (<=17 byte) passwords with the same hash. If the password was 18bytes long then there are 65536 (2^16) other passwords matching the same hash. This is the basis of BRUTE-FORCE cracking any HASH to find DATA which will match a specified hash. The time taken to BRUTE-FORCE a HASH is significantly high because there are 2^(sizeof(HASH)bits) different combinations to try, of course there are varying optimizations to this limited attack on a hash. A lot of HASH's have been cracked in a way which you don't need to BRUTE FORCE to find a collision or minimize the amount of combinations to such an extent it can be cracked very quickly. Given that any HASH or Encryption (excluding one time pads) can be brute forced given enough time I wouldn't call a brute force attempt a crack as such. To my knowledge MD5 hasn't been FULLY broken in a way that you can find any collision given a length unless you brute force it. It will always be the way that when 128bit hashes (MD5, etc) can be easily bruteforced, everyone else should be using the next best thing. Similar to what is happening with RSA encryption at the moment, computers are getting quicker and quicker so we keep moving up the amount of BITS we encrypt in. Though BRUTE FORCING prime numbers can be done much more quickly these days then brute forcing HASH's since you don't have to test EVERY single combination unlike in good HASH's. You can work out the odds of two different files having the same HASH given the number of collisions
FanJ, I need to reference the preceding paper at http://www.wilderssecurity.com/showthread.php?t=545 , but I note that some of the hyperlinks in that article seem to be either no longer active or incorrect. I think that some of these links were yours; can you validate them (and the relevance to the context in which they occur)? It also seems that some of the references that I provided in my original text are now missing. (Maybe you found those non-functional! ) I'll go back and try to dig out my original write-ups (including the ones that I provided to Albert, which you may never have seen). With regards to Jason's comments about MD5, I'd like to make some things clear, to avoid any misinterpretation of my position. I had hoped that Albert's rendition of NIS File Check would, in fact, also include the MD5 hash as an option. Apparently, Albert was unable to locate an authoritative (and compact) rendition of the various hashing algorithms that also included MD5 and so he went with the hashes he could find compact hash algorithms for. For example, the hashing algorithms that I used in both my prototypical Excel and Access implementations did include MD5 algorithms. Unfortunately, the DLL that I then used to accomplish that is almost as big as my application (some 800 kB, as I recall)! Subsequently, I found that Microsoft was including a library of hashing functions that could be used for this purpose, but I never rewrote my prototypes to use that library. Consequently, just to make sure no one misunderstands my position, I have no objection whatsoever to providing MD5 hashing as an option. Let's face it: Almost every software firewall (other than NIS/NPF) uses MD5 and it would be nice if NIS File Check also could provide this as a supplementary option. Indeed, I believe both TripWire and Integrity Master offer MD5 hashing as an option. And MD5 is almost universally used in the *NIX world. My comments about MD5, in other words, were simply intended as a cautionary note to those who might prefer this algorithm. My approach to acknowledging the good, the bad, and the ugly; and just like Hans Blix is about to learn, I suspect, not everyone is going to be pleased with the result.