Comment for mcrypt_create_iv

=== Original comment for PHP's mcrypt_create_iv ===

Block ciphers, at their core, are a pair of transformation algorithms, called transforms. One encrypts, one decrypts - in some cases the algorithms are one and the same, but that's not important. A block transform takes a fixed-length block of plaintext, transforms it using a secret key of some chosen size, and produces an identical-length block of ciphertext. Or, of course, vice versa (decryption).

The security model of a block cipher is, simplistically at least, defined to be "if you encrypt one block of plaintext with a randomly chosen key, it will be computationally infeasible for an attacker with knowledge of only the ciphertext (i.e. he does not know the key) to discover information about the contents of the plaintext". The reality is slightly different - there are also clauses around knowing part of the plaintext and not being able to discover any more of it - but that's another story.

Once you start encrypting more than one block of plaintext using the same block transform and the same key, all bets are off. In the Electronic Codebook (ECB) cipher mode, each block of plaintext is independently transformed using the same key. This leads to a problem: identical plaintext blocks produce identical ciphertext blocks, when the same key is used. This means that "patterns" in the data can be seen, especially in data formats that have repeating patterns or long sequences of identical data. This is best described visually, with an ECB-encrypted bitmap. See the Wikipedia article on ECB for a demonstration of this.

In order to fix this problem, cryptographers came up with a set of algorithms that, by mathematical proof and with certain caveats, allow for multiple block transforms to be securely performed with the same key, without running into the problems of ECB.

One such mode is Cipher Block Chaining (CBC). In CBC mode, each plaintext block is first xor'ed with the previous ciphertext block, before being transformed by the block cipher. This causes a chaining effect, whereby the contents of all previous blocks chain together to ensure that repeating sequences no longer produce equal blocks of plaintext. You may have noticed two problems here, though - first, what do you xor the first block with? It has no previous ciphertext block to be combined with, so instead we use an initialisation vector (IV). Second, if you independently encrypt two identical messages with this scheme, using the same key, you'll still get two identical ciphertexts. This is where the requirements of the IV come in.

An IV is always, for the purposes of every block cipher I know of, the same size as a block. Its one true requirement is that it is unique, but in most cases it is also required to be unpredictable (i.e. randomly chosen). It does not need to be secret - in fact the whole point is that it needs to be known at time of decryption. The IV essentially acts as a salt value, ensuring that even if two identical messages are encrypted with the same key, they would still produce entirely different ciphertexts as long as a unique IV was used for each. The IV must be unique per key, i.e. for any set of encryption operations using the same key, there must never be two operations performed with the same IV.

One problem with CBC is that it suffers from a property called malleability. A cipher is considered malleable when an attacker can modify the ciphertext message in some way in order to produce a useful and calculable change in the plaintext upon decryption, without the attacker ever knowing the key. In this case, if you transmit your message as C = IV | E(M, k, IV), the attacker might change IV. Upon decryption this affects only the first block, as all other blocks' so-called previous blocks are part of the ciphertext. If you xor the IV with some known value, the first block's plaintext also gets xor'ed with that known value. So, if you know that the first block says "Attack the Elvish house at dawn.", you might xor the last 5 bytes of the IV with 0x0A0E180000, so that it now decrypts to "Attack the Elvish house at noon.". This problem also exists in other blocks: it is possible to mess with the plaintext of any block as long as you don't care that it'll completely trash the previous plaintext block.

This DOES NOT mean that you should encrypt or obfuscate your IV - doing so would be an exercise in futility. Instead, you should apply an authenticity check. Block ciphers are all about confidentiality - they don't ensure integrity. You need to perform an integrity check yourself, and verify that the information was not tampered with. This is where an algorithm such as HMAC comes in.

HMAC is a way of turning a hash algorithm such as SHA256 into an authenticity check. Whereby a normal hash is constructed as h = H(m), a HMAC computation is constructed as  h = HMAC(m, k), where k is a secret key. If two parties are the only people who know the secret key, an attacker cannot forge the HMAC hash. For example, Alice computes h = HMAC("Hello", k) on a message, then appends h to the message and transmits it. Bob then receives the message, computes HMAC(m, k) on the received message m, and compares the result to the hash appended to the message. If they match, he knows that the message was not tampered with.

This trick can be used to ensure that malleability attacks don't work. You encrypt a message using AES-CBC or some other strong construction, append the IV to that message, then compute the HMAC hash of that entire blob and transmit it alongside. On the other end, the HMAC hash is verified, then decryption occurs. This way, the neither the message nor the IV can be tampered with in transit.

One note to make is that it is good practice to use independent keys for encryption and authenticity. This, at minimum, ensures that any break in one algorithm doesn't affect the other.