I thought, by way of a quick challenge, I would convert some nested C loops to Rust. Here is the C:
for (int i = 0; i < MAX; i += BLOCK_SIZE) {
for (int j = 0; j < MAX; j += BLOCK_SIZE) {
for (int ii = i; ii < i + BLOCK_SIZE; ii++) {
for (int jj = j; jj < j + BLOCK_SIZE; jj++) {
a[ii][jj] += b[jj][ii];
}
}
}
}
After a long conversation with clippy I end up with this in Rust:
fn do_it_3(a: &mut T, b: &T) {
for s in 0..BLOCKS {
for t in 0..BLOCKS {
a.iter_mut()
.skip(s * BLOCK_SIZE)
.take(BLOCK_SIZE)
.enumerate()
.for_each(|(i, row)| {
row.iter_mut()
.skip(t * BLOCK_SIZE)
.take(BLOCK_SIZE)
.enumerate()
.for_each(|(j, y)| {
*y += b[j + t * BLOCK_SIZE][i + s * BLOCK_SIZE];
});
});
}
}
}
I'm not happy. That is very long winded and unintelligible. It's also slow.
What actually is this?
Recently someone linked to the Intel article on writing cache friendly code: Intel Developer Zone. No big news for anyone now a days I guess. It includes the above simple C example that adds the elements two large arrays.
But I started to wonder how one would do that in Rust.
The closest I have come to writing something reasonable looking in Rust that almost runs as fast as the C is this:
for i in 00..(MAX / BLOCK_SIZE) {
for j in 00..(MAX / BLOCK_SIZE) {
for ii in (i * BLOCK_SIZE)..((i * BLOCK_SIZE) + BLOCK_SIZE) {
for jj in (j * BLOCK_SIZE)..((j * BLOCK_SIZE) + BLOCK_SIZE) {
a[ii][jj] += b[jj][ii];
}
}
}
}
Of course clippy does not like this at all. But it performs almost like C:
C : 44361 micro seconds
Rust: 49241 micro seconds
Perhaps I'm missing a trick here. Any suggestions?
Code is in the playgrond: Rust Playground