This is a borderline topic. Since I wanted to know about programming, CPU cache memory, reading CPU cache lines etc, I'm posting it here.
I was implementing AES algorithm in C/C++. Since performing GF(28) multiplications are computationally expensive without hardware support, I've optimized to use lookup tables for the AES S-box. But unfortunately lookup table based implementations are vulnerable to cache-timing attacks. Therefore, being very naive to CPU caches, I started learning about how it works, and how to circumvent such an attack without any additional computational costs.
I realize that in practice there are AES NI instructions and Bit Slice AES implementations but they are way beyond my current understanding.
I've learned from crypto.SE that the easiest method is to read all cache lines, or read the entire table, before I do the lookup. (This also affects my performance) But I do not know how to implement it in software, or the intricate concepts behind it.
In the C implementation reference guide of OpenSSL - aes-x86core.c  the authors have implemented a function:
static void prefetch256(const void *table)
{
    volatile unsigned long *t=(void *)table,ret;
    unsigned long sum;
    int i;
    /* 32 is common least cache-line size */
    for (sum=0,i=0;i<256/sizeof(t[0]);i+=32/sizeof(t[0]))   sum ^= t[i];
    ret = sum;
}
for loop i is incremented by 8 if sizeof(t[0]) is 4. Since the AES sbox is an unsigned char array of 256 elements and when we call prefetch256(sbox) , the sbox pointer is cast to an unsigned long pointer, therefore every element dereferenced using t[i] is 4 bytes. But I don't understand the reasons why we skip 32 bytes, and not read it fully (to protect against cache timing attacks)? What is the motivation behind sum ^= t[i] and setting ret = sum ?
Are there other simpler solutions to protect my implementation from cache timing attacks? Will this following simpler code help me:
   unsigned char tmp[256];
   memcpy(tmp, sbox, 256); // memcpy reads the full sbox table quickly..  
Nevertheless, in an analysis of AES finalists done by Daemen and Rijmen, Rijndael was deemed a “favorable” can- didate to secure against timing attacks, since it did not use branch instructions or data-dependent rotations [DR99].
Cache timing attacks exploit timing differences between accessing cached vs. non-cached data. Since accessing cached data is faster, a program can check if its data is cached by measuring the time it takes to access it. In one form of a cache timing attack, the attacker fills the cache with its own data.
In cryptography, a timing attack is a side-channel attack in which the attacker attempts to compromise a cryptosystem by analyzing the time taken to execute cryptographic algorithms.
In the for loop i is incremented by 8 if sizeof(t[0]) is 4.
But I don't understand the reasons why we skip 32 bytes, and not read it fully (to protect against cache timing attacks)?
In the for loop, i is incremented by the equivalent of 32 chars (regardless of what sizeof(t[0]) happens to be) because "32 chars" (or 32 bytes) is what the authors determined is the minimum cache line size for all CPUs they care about. Note that you only need to read from 1 byte of the cache line to ensure that the entire cache line is fetched into cache.
What is the motivation behind sum ^= t[i] and setting ret = sum ?
A good compiler will notice that the data you're reading is not used, and will avoid doing an "unnecessary" (for correctness of the "C abstract machine") read to improve performance without knowing that the "unnecessary" read is necessary for reasons that the compiler can't be expected to know about. To prevent the compiler from optimizing like that you need to trick it into thinking the data is actually used, or use volatile. The OpenSSL authors are doing both (trying to trick the compiler into not optimizing by doing the sum ^= t[i] and ret sum; and also using volatile), possibly because (historically) plenty of old compilers had bugs involving volatile.
Also note that there's still a very tiny timing problem - cache could be flushed (e.g. by task switch, etc) after the data was prefetched but before part of the table is used for AES; so there's an (extremely tiny) chance that an attacker can still use a cache timing attack to determine which part of the table AES used. See "Confidence In Cache Timing Attack Prevention" (below).
Will this following simpler code help me:
It's very likely that the compiler will turn your code into literally nothing (and if it didn't, it'd have the same problems as prefetch256() and may be slower because you're writing to memory rather than only reading).
Are there other simpler solutions to protect my implementation from cache timing attacks?
Everything is a compromise between complexity, portability, security, features and performance; and "simpler" almost always means "worse portability" and/or "worse quality" and/or "worse features" and/or "worse performance". You can't make the quality or features worse while still protecting against cache timing attacks. You can't make performance worse because it's already as simple as it can be.
You might (or might not) be able to make it simpler by sacrificing portability. For example; if you know that the whole table fits in a single cache line on one computer (and is aligned to a cache line boundary), then you could do nothing for that one computer and say the code should never be used for any other computer.
Confidence In Cache Timing Attack Prevention
One of the key factors in guarding against cache timing attacks is how much control the attacker has. The typical scenario is that the attacker floods the cache (pollutes it with known data to cause its previous contents to become evicted due to "least recently used"), then lets the victim do something, then measures how quickly it can access its known data to determine if that known data is still in the cache or was evicted. If a cache line of known data was evicted the attacker knows that the victim accessed something that has the same location in the cache as the evicted cache line.
The worst possible case is that the attacker is able to do this extremely often (e.g. for every instruction the victim executes), the cache has no associativity (direct mapped), and the attacker either knows everything about how the victim uses virtual addresses and the relationship between the victim's virtual addresses and locations in the cache (likely including the relationship between virtual addresses and physical addresses) or is in the same process (e.g. shared library, where they can access the table themselves to determine if it was accessed instead of relying on the eviction of other data). In this case the only defense is to ensure that all memory access patterns are always the same (never dependent on any data). This is extremely hard, but not impossible - e.g. if you want to read one byte from a table (e.g. "byte = table[index]" where you don't want the attacker to know anything about index), you can read all preceding cache lines, then read the byte you want, then read all following cache lines (so that it always looks like a sequential read of the whole table) and do those accesses at a fixed rate (no "pause before reading the byte you want" and no "pause after reading the byte you want", including "no pause caused by branch mispredictions"). If you do this; then you can have extremely high confidence in your ability to guard against cache timing attacks (up to guaranteeing that your code is immune to all possible cache timing attacks).
However; it's almost impossible for the attacker to gain that level of control, extremely hard to write code like this, and code like that would have large performance overheads.
At the other extreme; you can do nothing and have no confidence in your ability to prevent cache timing attacks. Everything else falls between these extremes.
The question then is; what is a good compromise? This depends on too many factors - how much control the attacker has, if the attacker is in the same process, if the attacker can repeat the attack (and use a probabilistic approach to defeat any "noise"), how much the data is worth to the attacker (sane thieves don't spend more than $2 in an attempt to steal something that's worth $2 to the thief), how much the data is worth to the victim (nobody spends $100 per day to protect something that can be replaced for $2), what other mitigations are in place (e.g. most operating systems provide "address space layout randomization" now), etc.
For a good compromise; for your purposes, I'm personally fond of the "make it look like a sequential read of the whole table" approach, but a significantly simpler version that doesn't care very much about the fixed rate of accesses (or the "pause before/after reading the piece you want" and doesn't care about cache line sizes (accessing every byte won't cost much if the table is only 256 bytes, partly because most accesses will be "cache hit"). An example of this might be:
uint16_t fetch_byte_from_table(int index) {
    size_t table_entries = sizeof(table)/sizeof(table[0]);
    uint8_t dummy = 0;
    uint8_t data = 0;
    for(i = 0; i < table_entries; i++) {
        if(i == index) {
            data ^= table[i];
        } else {
            dummy ^= table[i];
        }
    }
    return ((uint16_t)dummy << 8) | data;        // Caller can ignore the higher 8 bits
}
Of course there are tricks you can use to try to hide or avoid the branches (e.g. data ^= table[i] * (i == index); dummy = data ^= table[i] * (i != index);), but they depend on compiler and target CPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With