Code speed optimization please

matiu2 · July 11, 2019, 2:39am

I made this pull request for the http package, so that it'd work in wasm; it works with --release, but not in debug mode.

I removed a whole lot of macro code and replaced it with a giant match statement; however it's now 6% slower and I don't know why.

Any direction appreciated: Refactored header name matching for WASM target by matiu2 · Pull Request #324 · hyperium/http · GitHub

jonh · July 11, 2019, 10:06am

Could do (chance compiler already does away with bound check so of no benefit; only test would tell.)

for i in 0..len {
    b[i] = table[data[i] as usize]
}

to

for (d,b) in data.iter().zip(b.iter_mut()) {
    *b = table[*d as usize]
}

The old code goes through fewer steps since broken by match over len and unwinds this loop for each length.

kornel · July 11, 2019, 10:08am

It's sort of expected. The debug mode generates dumb and inefficient code on purpose. I'm not sure if it's worth worrying about it, because without the optimizer there's not much you can do. In debug mode abstractions are not zero-cost, so the only thing you can do is to remove abstractions and make code do less work.

When the application is too slow to use in the debug mode, the easy way out is to enable optimizations in your Cargo.toml:

[profile.dev]
opt-level=2

matiu2 · July 12, 2019, 4:25am

My trouble is that compiling with --release makes my change -> iteration -> preview loop longer.

matiu2 · July 12, 2019, 4:25am

Thanks John, that gave me an extra 3 nano seconds

matiu2 · July 12, 2019, 4:34am

An interesting thing is, if I order the inside of the match statement alphabetically, the 'non-match' takes twice as long to run.

I wouldn't have expected the order of the inside of the match statement to affect performance. I expected the compiler to magically do its best and get the same result no matter what..

matiu2 · July 12, 2019, 4:54am

I also tried breaking up the whole match into match len { 2 => match &b[..len] { ... but it made no difference to performance.

KillTheMule · July 12, 2019, 7:03am

I've not understood all of this, but I'm seeing this match statement on byte slices, and reminded me of a similar case I've optimized. You could have a look at this commit on how I used the byteorder crate here.

One thing there was pretty instructional for me, and pointed me to this optimization: My first version mainly contained a match statement on byte slices of length 8, all of which ended with a blank. I figured I could remove that blank, so the code has to read one byte less. But: performance was way worse! It was shown to me that this was the llvm recognized reading 8 bytes as reading a u64, but reading 7 bytes could not be optimized this way. So I employed byteorder to leverage that even further, and maybe that's something that can help you, too.

zeroexcuses · July 12, 2019, 8:04am

I'm not sure why this is a problem. I'm also doing Rust/wasm32 developmnet. My setup is that I have 3+ crates:

app/ :: contains actual code
app-debug/ :: imports app, builds in debug mode via cargo web start
app-deploy/ :: imports app, builds in release mode, builds via 'cargo build', very rarely used, only used when I need to deploy to AWS

So you end up developing on debug, and deploying on release.

Yandros · July 12, 2019, 10:10am

This should be using a precomputed HashMap to dispatch it: I recommend using ::phf and its precomputed perfect hash functions to do this:

use ::phf::{
    phf_map,
    Map,
};

fn parse_hdr<'bytes> (
    data: &'bytes [u8],
    buf: &'bytes [u8; 64],
    table: &'_ [u8; 256],
) -> Result<
        HdrName<'bytes>,
        InvalidHeaderName,
     >
{
    let len = data.len();

    assert!(
        len < MAX_HEADER_NAME_LEN,
        "header name too long: max length is {}",
        MAX_HEADER_NAME_LEN,
    );

    Ok(match len {
        | 0 => {
            return Err(InvalidHeaderName::default());
        },
        | len if len > 64 => {
            HdrName::custom(data, false)
        },
        | _ => {
            use self::StandardHeader::*;
            static STANDARD_HEADERS: Map<&[u8], StandardHeader> = phf_map! {
                b"te"  => Te,
                b"age" => Age,
                b"via" => Via,
                b"dnt" => Dnt,
                // ...
            };
            let buf = &buf[.. len];
            match STANDARD_HEADERS.get(buf) {
                | Some(&standard_header) => {
                    standard_header.into()
                },
                | None => {
                    if buf.contains(&b'\0') {
                        return Err(InvalidHeaderName::default());
                    } else {
                        HdrName::custom(buf, true)
                    }
                },
            }
        },
    })
}

The currently released version, 0.7.24,
requires the nightly #![feature(proc_macro_hygiene)],
but the repository is currently working on the next 0.8.0 release,
which fixes it so that the macro can be used in stable rust.
- you can thus, in the meantime, work with it in stable rust by using the following Cargo.toml:
```
[dependencies]
phf = { version = "0.7.24", features = ["macros"] }

[patch.crates-io]
phf = { git = "https://github.com/sfackler/rust-phf" }
```
  - Here is a gist to test it.

drewkett · July 12, 2019, 2:29pm

I don’t have a ton of experience with this kind of problem, but the aho-corasick crate generates a finite state machine that I would think would work better than a simple match statement.

Edit:

After reading some more it sounds like the crate I suggested does more than you need and that a precomputed hash table as suggested may be the best bet. You could also try a BTreeMap or a radix trie if you want some other ideas of things to try.

https://github.com/BurntSushi/aho-corasick/blob/master/README.md

johannesvollmer · July 17, 2019, 9:50am

Does this mean the compiler is not smart enough to transform a match statement into hashed lookups on its own? Why can't this (or a similar optimization) be done automatically?

Yandros · July 17, 2019, 10:38am

The compiler does not transform a match statement into hashed lookups, because:

match statements are one of the core building blocks of thr language; more high-level abstractions should be built on top of it, not within it;
and mainly because it would go against Rust's explicitness, leading, for instance, to zero-cost abstractions ceasing to be such:
- Imagine the match statement did perform implicit hash-table lookups.
  
  Then imagine deciding to create a supposedly zero-cost type-level abstraction, such as:
```
#[derive(Clone, Copy)]
struct HeaderBytes<'a> /* = */ (
    &'a[u8],
);

let buf = HeaderBytes(&buf[.. len]);

match buf {
    | HeaderBytes("te") => ...,
    ...
}
```
  And then suddenly we are back to "slow" / classic match statement because our newtype is not hashable, and all this, again, happening implicitly.
another reason would be that hash-table lookup only becomes more efficient after a certain point, very hard to guess exactly by a "smart" compiler. This would mean that in some cases the compiler would choose a construct slower than the optimal one, while the programmer is oblivious to all this because of it being implicit.

On the other hand, this is a classical situation where a macro can provide this behavior: using a macro would make the hash-based nature of the lookup no longer implicit, thus, for instance, allowing it to make compilation fail with the above newtype example (complaining against the lack of a Hash implementation on the newtype).

system · October 15, 2019, 10:40am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Understanding slow link times help	0	107	April 5, 2024
SIMD help with code slower in Release mode help	7	1176	November 27, 2019
Can I replicate the compiler's optimizations in debug mode help	6	597	July 12, 2019
WASM faster than native x86-64 build. WT..?!	39	3832	November 11, 2019
How to Optimize For Code Size?	3	973	December 14, 2021

Code speed optimization please

Related Topics