Rust-crypto: AES-CBC producing weird benchmarks

Hello everyone, I am currently working on a project that is going to become the foundation for some bigger research on Rust's crypto landscape and I am currently facing an issue with Rust-crypto's AES implementation when using CBC as block mode.

I wrote benchmarks using criterion and its cycles-per-byte plugin for ECB, CBC, CTR and GCM to measure how they perform in terms of Cycles-per-Byte and I am getting very reasonable results for ECB, CTR and GCM. In fact the results for these three modes are exactly as they are supposed to be.

CBC however does not produce expected results and I do not understand how or why.

Here's the premise: All benchmarks are executed on an I7-8700k with 3,7 GHz core frequency and disabled turbo boost. According to Intels specifications this CPU uses an AES pipeline length of 4.
The correct amount of Cycles-per-Byte should therefore be 2.5cpb for CBC-128 encryption. This I was able to verify with a basic benchmark in C.

Now the problem: My CBC-128 benchmarks in Rust produce a result of 4.5cpb, so 2cbp above the "normal".

Could somebody take a look at my code?
Here's the content of aes-cbc.rs:

use aes::cipher::{BlockEncryptMut, KeyIvInit};
use aes::{Aes128, Aes192, Aes256};
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use criterion_cycles_per_byte::CyclesPerByte;
use RustCrypto_AES_Benchmarks as benches;

type Aes128CbcEnc = cbc::Encryptor<Aes128>;
type Aes192CbcEnc = cbc::Encryptor<Aes192>;
type Aes256CbcEnc = cbc::Encryptor<Aes256>;

pub const KB: usize = 1024;

fn bench(c: &mut Criterion<CyclesPerByte>) {
    let mut group = c.benchmark_group("aes-cbc");

    let mut cipher128 = Aes128CbcEnc::new(&Default::default(), &Default::default());
    let mut cipher192 = Aes192CbcEnc::new(&Default::default(), &Default::default());
    let mut cipher256 = Aes256CbcEnc::new(&Default::default(), &Default::default());
    for size in &[KB, 2 * KB, 4 * KB, 8 * KB, 16 * KB] {
        let mut buf = vec![Default::default(); *size / 16];

        group.throughput(Throughput::Bytes(*size as u64));

        group.bench_function(BenchmarkId::new("encrypt-128", size), |b| {
            b.iter(|| cipher128.encrypt_blocks_mut(&mut buf));
        });

        group.bench_function(BenchmarkId::new("encrypt-192", size), |b| {
            b.iter(|| cipher192.encrypt_blocks_mut(&mut buf));
        });

        group.bench_function(BenchmarkId::new("encrypt-256", size), |b| {
            b.iter(|| cipher256.encrypt_blocks_mut(&mut buf));
        });
    }

    group.finish();
}

criterion_group!(
    name = benches;
    config = Criterion::default().with_measurement(CyclesPerByte);
    targets = bench
);

criterion_main!(benches);

The entire project can be found here:

CBC encryption can not be parallelized, so it's expected that it will be significantly slower than parallelizable modes like ECB, CTR or CBC decryption (GCM uses CTR under the hood). It happens because data dependencies cause CPU pipeline to stall, thus reducing measured performance. Are you sure that your C benchmark is correct?

Have you tried to measure performance with RUSTFLAGS="-C target-feature+aes"? It should allow compiler to remove CPU feature detection branches and simplify necessary optimizations.

Either way, Rust compile is not ideal and it's not uncommon for it to produce non-ideal code. The abstract-heavy approach which RustCrypto has to use does not help with it either. It could be worth to inspect generated assembly to see if there are some glaring inefficiencies in it.

I am fairly sure that the C benchmark is correct. It has been verified by my supervisor but if you want to re-confirm it, this is the C code: (It's not pretty but it gets the job done)

#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <inttypes.h>
#include <wmmintrin.h>
#include <emmintrin.h>
#include <assert.h>
#include <smmintrin.h>

#ifndef REPEAT
	//#define REPEAT 5000
	// Set to 10000, 100000 or 1000000
	#define REPEAT 1000000
#endif
#ifndef WARMUP
	#define WARMUP REPEAT/4
#endif
	uint64_t start_clk,end_clk;
	double total_clk;
__inline uint64_t get_Clks(void) {
	uint32_t __a,__d;
	__asm__ __volatile__ ("rdtsc" : "=a" (__a), "=d" (__d));
	return ((uint64_t)__a) | (((uint64_t)__d)<<32ULL);
}
#define MEASURE(x)  for (int i=0; i< WARMUP; i++)		   \
								 {x;}				   \
					start_clk=get_Clks();			   \
					for (int i = 0; i < REPEAT; i++)		\
					{								   \
								 {x;}				   \
					}								   \
					end_clk=get_Clks();				 \
					total_clk=(double)(end_clk-start_clk)/REPEAT;

#define TIME_IT(name, func, nbytes, MULTIPLE) \
	printf("%s-%d: ", name, nbytes); \
	MEASURE(func); \
	printf("%g cpb\n", total_clk/(nbytes)/(MULTIPLE));

static __m128i assist128(__m128i a, __m128i b)
{
    __m128i tmp = _mm_slli_si128 (a, 0x04);
    a = _mm_xor_si128 (a, tmp);
    tmp = _mm_slli_si128 (tmp, 0x04);
    a = _mm_xor_si128 (_mm_xor_si128 (a, tmp), _mm_slli_si128 (tmp, 0x04));
    return _mm_xor_si128 (a, _mm_shuffle_epi32 (b ,0xff));
}

static void AES_set_encrypt_key(const unsigned char *userKey, __m128i* sched)
{
    sched[ 0] = _mm_loadu_si128((__m128i*)userKey);
    sched[ 1] = assist128(sched[0], _mm_aeskeygenassist_si128(sched[0],0x1));
    sched[ 2] = assist128(sched[1], _mm_aeskeygenassist_si128(sched[1],0x2));
    sched[ 3] = assist128(sched[2], _mm_aeskeygenassist_si128(sched[2],0x4));
    sched[ 4] = assist128(sched[3], _mm_aeskeygenassist_si128(sched[3],0x8));
    sched[ 5] = assist128(sched[4], _mm_aeskeygenassist_si128(sched[4],0x10));
    sched[ 6] = assist128(sched[5], _mm_aeskeygenassist_si128(sched[5],0x20));
    sched[ 7] = assist128(sched[6], _mm_aeskeygenassist_si128(sched[6],0x40));
    sched[ 8] = assist128(sched[7], _mm_aeskeygenassist_si128(sched[7],0x80));
    sched[ 9] = assist128(sched[8], _mm_aeskeygenassist_si128(sched[8],0x1b));
    sched[10] = assist128(sched[9], _mm_aeskeygenassist_si128(sched[9],0x36));
}

void ae_encrypt			(__m128i* expkey,
						const void	*nonce,
						const void	*pt,
						int 		pt_len,
						void		*ct)
{
	__m128i lastblock = _mm_loadu_si128((__m128i*) nonce);
	__m128i block;
	unsigned long length =  pt_len / 16;
	for (int i = 0; i < length; i++) {
		block = _mm_loadu_si128(&((__m128i*)pt)[i]) ^ lastblock ^ expkey[0];
		
		for(int j=1; j <10; j++)
			block = _mm_aesenc_si128 (block,expkey[j]);
			
		lastblock = _mm_aesenclast_si128 (block,expkey[10]);
		_mm_storeu_si128 (&((__m128i*)ct)[i], lastblock);
	}
}

int main(void)
{
	unsigned char m[16384] = "";
	unsigned char c[16384];
	unsigned char key[16] = { 0x67,0xc6,0x69,0x73,0x51,0xff,0x4a,0xec,0x29,0xcd,0xba,0xab,0xf2,0xfb,0xe3,0x46, };
	unsigned char nonce[16] = {0};

	__m128i expkey[11];

	AES_set_encrypt_key(key, expkey);

	srand(time(NULL));
	for (unsigned i = 0; i < sizeof m/4; i++) {
		((uint32_t*) m)[i] = rand();
	}

	for (int nblocks = 32; nblocks <= 1024; nblocks*=2) {
		int nbytes = nblocks*16;

		TIME_IT("AES-CBC encryption", ae_encrypt(expkey, nonce, m, nbytes, c), nbytes, 1)
	}
	return 0;
}

I have tried measuring with manual providing of RUSTFLAGS, though it did not have any impact, so that can also not be it.

The thing that bugs me is that ECB and CTR perform just as expected with only CBC being that off, it's weird.
I'll take a look at the generated assembly and hopefully this will shed some light on it.

Ok so I took a look at the generated assembly. Now I am not an expert on reading assembly but I quickly came to realize, that the assembly generated by my benchmarks for AES-ECB, CTR and GCM definitely make use of pipelining, while I am failry certain that AES-CBC does not make use of any pipelining whatsoever. May I ask for your assistance in this matter?

Have you compared it with C assembly? What do you mean by pipelining and what do you expect to see? If I understand you correctly, AES-CBC encryption can not be pipelined since it has tight data dependency, i.e. most of computation is performed in a single dataflow line.

Oh nevermind I interpreted the results wrong. As I said I don't know much about assembly but after spending some time with it and comparing the results to the CBC encryption in C it all cleared up.
However I still can't say whether there are "glaring inefficiencies" inside, as you phrased it. May I share the generated assembly code with you?

I took a look at generated assembly and it looks like that for some reason Rust compiler generates code which stores each IV and block byte in it's own register and XORs them separately byte-by-byte, it then assembles those bytes into an XMM register, encrypts it using AES-NI, and then splits result back into separate registers. Ideally block and IV should stay inside XMM registers, which is clearly not the case here.

I will try play with crates in the following days to see if it's possible to help compiler to generate better code.

I see, thank you very much this clears things up a lot!

Try to compile code with the latest nightly. It results in a much better codegen on my test case. Though built-in benchmark results are somewhat weird, so it looks like codegen quality is not so stable.

Also this PR should help with codegen as well.

1 Like

Updated to latest nightly and pulled the updated code.
It's looking MUCH better now with cbc-128 only being 0.3 cpb behind the C implementation in all cases, while it was a difference of 2.2 cpb before.

Thank you for your time, much appreciated!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.