Thanks for the replies y'all!
I'll go through and try to understand them.
looking at assembly
@hellow: I don't know much about assembly, so I looked up x86 architecture and instructions.
Here are my annotations for fellow n00bs, let me know if I got anything wrong.
In both cases, I assume dil
contains the argument and rax
will be interpreted as holding the return value.
example::match2:
xor ecx, ecx //clear register ecx (see https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and/33668295 for why it's xor)
cmp dil, 66 //compare dil with 66, which is ASCII 'B'.
sete cl //if comparison was equal, set lowest byte of the rcx register to 1.
add rcx, rcx //add rcx to itself, which makes it 2 (which is the mapping we wanted from the match for 'B')
cmp dil, 65 //compare dil with 65, which is ASCII 'A'.
mov eax, 1 //move 1 into eax, which is lower 32 bits of rax.
cmovne rax, rcx //if comparison was not equal (i.e., dil wasn't 'A'), move rcx into rax.
ret
The lookup is a bit more complicated:
example::lookup2:
movzx eax, dil // moves 1-byte dil into lowest byte of 4-byte eax.
lea rcx, // load effective address into rcx
[rip + .L__unnamed_1] // the address is the instruction pointer plus the address of the lookup table
mov rax, // move into rax
qword ptr [rcx + 8*rax] // a pointer's worth of bytes (8) from the address given by `rcx + 8*rax`
ret
.L__unnamed_1:
.asciz "\000\000..." //a string with 8188 characters; each \000 is a byte, so there are 2047 total bytes here.
Grouping the .L__unnamed_1
lookup table data into chunks of 8 bytes, the 65th and 66th rows (counting from 0) are:
\001\000\000\000\000\000\000\000
\002\000\000\000\000\000\000\000
which are the values 1 and 2 in 64-bit little-endian.
Since 'A' is 65 and should map to 1, this must be how Rust has compiled the lookup table into, uh, assembly.
A few things I don't quite follow:
-
What's the point of the lea
instruction? Why can't the lookup just be mov rax, qword ptr [.L__unnamed_1 + 8*rax]
?
-
Putting that aside, why does rip
need to be involved in the lea
? Isn't .L__unnamed_1
a fixed address?
-
Why is the lookup table data only 2047 bytes total? In rust, it's [u64; 256]
, so shouldn't that be 8 * 256 = 2048
bytes?
- Answer: the
.asciz
pads the string with a final zero byte. It'd be 2048 bytes if the .ascii
directive had been used. At some point rust or llvm lost the metadata about what the lookup table was, because if it knew it was bytes it probably would have used the .quad
directive. (Via conversation with a friend.)
-
This is just as assembly notation question: The non-zero bytes of the third lookup table (which goes up to 8) looks like
\001\000\000\000\000\000\000\000
\002\000\000\000\000\000\000\000
\003\000\000\000\000\000\000\000
\004\000\000\000\000\000\000\000
\005\000\000\000\000\000\000\000
\006\000\000\000\000\000\000\000
\007\000\000\000\000\000\000\000
\b\000\000\000\000\000\000\000
What's the deal with that last line?
I'd have expected it to be \010
(if this were octal) or \008
(if this were hex).
on caching
@farnbams: For the case of the alphabet size = 2, it's not clear how lookup (which loads from a table) could ever be faster than match (which just does arithmetic without any jumps or memory reads).
I wouldn't expect reading from cache to be faster than xor, cmp, add, and mov.
But lookup is faster for every "inlined mix" and the 4 and 8 sized "mix with rotate" tasks.
Presumably inlining lookup allows for further optimizations that aren't possible with the match, but I'd need to inspect the generated assembly to test that assumption.
on branch prediction
@vitalyd: From what I can tell, there isn't any branching going on in the match.
The match2 has no jump or anything.
The match4 has one jump:
example::match4:
add dil, -65
cmp dil, 3
ja .LBB2_2
movzx eax, dil
add rax, 1
ret
.LBB2_2:
xor eax, eax
ret
but it only happens when the input data isn't one of 'A', 'B', 'C', 'D'
, and that never happens in the synthetic generated data.
My understanding is that the real "penalty" of mispredicting a branch is loading the wrong stuff from memory, and it's slow to load the right stuff.
But in this case, nothing is being loaded from memory.