In the last blog about data integrity, we learned that Cyclic Redundancy Check (CRC) doesn't provide any protection against malicious adversaries.
To protect against adversaries that are purposefully trying to modify data, we will need a stronger solution than error detecting/correcting codes like CRCs. Ideally, no (computationally bounded) adversary should be able to convince us that illegitimate data came from our devices or modify our data without us noticing. Before introducing two possible solutions, we will review a few of the building blocks that are needed. First, we need to understand the high-level differences between symmetric and asymmetric cryptography. For a more in-depth introduction, see our blog series on encryption.
Different cryptographic solutions can be grouped into classes based on how the private key is used. When a single private key is shared by all users, they all have the same knowledge of the key material. Since the information each user has on the key material is symmetric, this class is referred to as symmetric cryptography. In the other class, each user has a set of two mathematically related keys: a private key known only to them, and a public key which they share with everyone. Since the private key is known only to the user, the information between users is now asymmetric.
The important distinctions to note are that in symmetric cryptography, every user has a copy of the same private key, and the same key is used for both types of operations: encryption and decryption, as well as generating and verifying a signature. In asymmetric cryptography, only the user knows their private key, and only that private key may be used to decrypt or sign data.
The other tool that’s needed is a method to create a short digest or “fingerprint” of arbitrarily large amounts of data. Since cryptographic operations are computationally expensive, it’s more convenient to run them over a short summary that represents the (potentially much larger) amount of data.
A cryptographic hash function takes an arbitrary length input and produces a fixed length digest or “hash” which will serve as a summary of the data. Since the function maps arbitrarily large inputs to a fixed output range, collisions are guaranteed to occur. However, by making the size of the hash output sufficiently large, the collision probability can be made negligible.
Another important property of a hash function is that even very small changes in the input data, say a single bit flip, will result in substantial changes in the output hash. On average, a single bit flip in the input data – even if it’s multiple terabytes – will result in about 50% of the output bits of the hash changing. This means that small changes to the original data will result in a substantially different digest or summary being produced as the output of the hash function. Combined with the negligible probability of a collision, this will serve as our fingerprint of the data to be protected.
When trying to protect data, we often think of encryption (for a refresher, read our previous blogs). Encrypting data with a secret key can make that encrypted ciphertext unreadable by anyone who does not possess the key. So why don’t we just encrypt the data to protect it?
Encryption on its own only provides confidentiality and does not provide authenticity or integrity. Let’s examine one commonly used method of encryption: AES in counter mode. AES is a symmetric block cipher which uses a shared secret key to encrypt blocks of 16 bytes of data at a time. Block ciphers can operate in several different modes (methods of combining each block of data), and one commonly used mode is counter mode. In this mode, there is a monotonically increasing counter which is encrypted to produce a pseudorandom keystream. To encrypt data, this keystream is XORed with the plaintext input to generate the ciphertext.
The ciphertext can be stored or sent together with the initial counter value used (initialization vector) to later be decrypted with the same key. When decrypting, the initial counter value is loaded and again encrypted with the same key to produce the same keystream. This keystream is XORed with the ciphertext to cancel out the encryption operation and recover the plaintext.
Ciphertext is called malleable if modifications to the ciphertext also modify the underlying plaintext in a predictable way. Block ciphers in counter mode (as well as other stream ciphers) are trivially malleable. Flipping a single bit in the ciphertext will flip the same bit in the plaintext when it is decrypted, without changing any of the other bits. This means that an attacker can easily modify encrypted data and predict the effect of this change on the decrypted data, even when they do not know the key or the underlying plaintext message.
Consider the case where there are two systems communicating by wirelessly sending and receiving messages encrypted by using AES in counter mode. An adversary may be able to intercept the communications, and even though they may not be able to decrypt the messages, they know that they can flip bits in the ciphertext to flip the same bits in the plaintext. In some cases, it may be easy to guess what the plaintext might be. These types of machine to machine communications generally have a structured and predictable format. The attacker may have acquired documentation related to it or may just be able to guess it or reverse engineer it.
In our example, an adversary would like to disrupt orders in a fulfillment center. The adversary knows the format of the messages exchanged (by reading the user manual), the command to process an order is “fill”, and the command to delete an order is “kill”. In the diagram above, we can see that the command to fill an order is correctly encrypted and decrypted. Let’s examine how an adversary can alter this command undetected, even though the message will be encrypted!
The adversary knows the location of the “fill” command in the plaintext message and wants to maliciously change the command to “kill” to cause the order to be cancelled, rather than fulfilled.
First, the adversary computes the difference ∆ between the message M = “fill” and M’ = “kill”. By XORing this difference ∆ with the ciphertext block containing the plaintext message “fill”, the character “f” is changed to “k” in the decrypted message!
Since encryption alone doesn’t prevent this data from being modified, could we protect its integrity by encrypting the message along with its CRC? The attacker will not know the value of the CRC, and if they modify the ciphertext the CRC should detect the modification, right? As we have seen in the previous section, encryption can be malleable and unfortunately so can an encrypted CRC.
Suppose we have an encrypted message M which also contains an encrypted CRC over this message, and that we know the location in the ciphertext corresponding to the encrypted CRC. If we want to modify the message M to be a new message M’, we also need to modify the CRC.
Let ∆ = M Å M’ be the difference between the original message M and the new message M’. This is the two XORed together, or the bits which have changed (each bit of ∆ is a 1 if it is a bit in the message which we have flipped). The original CRC will not match the newly computed CRC over M’. We will need to modify the CRC so it will be valid over the new message M’.
Since CRC is a linear function, CRC(M’) = CRC(M) Å CRC(∆). In other words, to get a new valid CRC, we compute the CRC function over ∆ (the CRC of the bits which we have changed) and XOR this value with the CRC. And as we saw previously, we can do all of this on the ciphertext without even knowing the real values of the message or the value of the CRC and we can change the message to whatever we want with a valid CRC!
This type of weakness is not just theoretical. It is present in the Wired Equivalent Privacy (WEP) algorithm, which allows WEP traffic to be modified in a similar manner.
So just encrypting data on its own doesn’t protect it against modification, nor do encrypted CRCs. If plaintext CRCs, encrypting the data, or encrypting the data along with its CRC don’t protect our data against intentional modifications, what does? In our next blog in the data integrity series, we will explore a couple of potential solutions.