Auto-vectorize into VCOMPRESS

I want to get Rust code to auto-vectorize to VCOMPRESSD. I've tried the algorithm mentioned in the article

(KL, VL) = (16, 128), (32, 256), (64, 512)
k := 0
FOR j := 0 TO KL-1:
    IF k1[j] OR *no writemask*:
        DEST.byte[k] := SRC.byte[j]
        k := k +1

I've tried unrolls and some tricks, but it doesn't seem like I can't get Godbolt to output VCOMPRESS instruction.

Old Godbolt playground

Compiler Explorer

I set -C opt-level=3 -C target-feature=+avx512f so it should have emitted something if the compiler could recognize it.

Does anyone have any tips or tricks? I could have sworn it was vectorized once, albeit in a different method, but can't recollect which one.

EDIT: Updated based on SkiFire13 notes

It doesn't seem to be supported by LLVM if this open issue is still up to date.

Besides that:

-C target-feature=native is not a valid (the compiler is even throwing a warning at you!). The correct one would be -C target-cpu=native. But still, don't use -C target-cpu=native on Godbolt, it will produce results that depend on exactly which machine is being used by Godbolt. This is particularly important for you given that AVX-512 is not supported by all CPUs, so you might hit a machine that doesn't support it and get different results semi-randomly. Prefer instead to enable exactly the features you care about, e.g. with -C target-feature=+avx512f,+avx512vl


Your code also has a typo:

    let x4 =  k1.get_unchecked(4);
    if k1[1] {

The second line should be if k1[4] {.

I'm also not sure why you're using unchecked indexing into k1 just to use checked indexing for the value value right after.

3 Likes

I fear this is the correct answer, even if it's not an actual solution. I'll keep topic open for a while to see if anyone else has an unobvious "hack".

Hopefully this has been addressed in new godbolt link.

Is there a reason you're looking for an hack instead of just using the intrinsic you need?

I want to write a SIMD friendly parser on stable Rust. So no portable simd (keeping dependencies as light as possible).

To achieve this I have implemented parser as a trait (e.g. Stage1Scanner), that's implemented per architecture/feature (e.g. Avx512Scanner, NeonScanner, etc.). I do this because different arch have different optimal lane sizes and registers.

However I need a fallback that's generally as good as it gets. Ideally something that auto-vectorizes to optimal code so I keep more in default trait implementation. Less platform specific code, the less I need to write.