Code speed optimization please

I made this pull request for the http package, so that it’d work in wasm; it works with --release, but not in debug mode.

I removed a whole lot of macro code and replaced it with a giant match statement; however it’s now 6% slower and I don’t know why.

Any direction appreciated: https://github.com/hyperium/http/pull/324

1 Like

Could do (chance compiler already does away with bound check so of no benefit; only test would tell.)

for i in 0..len {
    b[i] = table[data[i] as usize]
}

to

for (d,b) in data.iter().zip(b.iter_mut()) {
    *b = table[*d as usize]
}

The old code goes through fewer steps since broken by match over len and unwinds this loop for each length.

It’s sort of expected. The debug mode generates dumb and inefficient code on purpose. I’m not sure if it’s worth worrying about it, because without the optimizer there’s not much you can do. In debug mode abstractions are not zero-cost, so the only thing you can do is to remove abstractions and make code do less work.

When the application is too slow to use in the debug mode, the easy way out is to enable optimizations in your Cargo.toml:

[profile.dev]
opt-level=2
1 Like

My trouble is that compiling with --release makes my change -> iteration -> preview loop longer.

Thanks John, that gave me an extra 3 nano seconds :slight_smile:

An interesting thing is, if I order the inside of the match statement alphabetically, the ‘non-match’ takes twice as long to run.

I wouldn’t have expected the order of the inside of the match statement to affect performance. I expected the compiler to magically do its best and get the same result no matter what…

I also tried breaking up the whole match into match len { 2 => match &b[..len] { ... but it made no difference to performance.

I’ve not understood all of this, but I’m seeing this match statement on byte slices, and reminded me of a similar case I’ve optimized. You could have a look at this commit on how I used the byteorder crate here.

One thing there was pretty instructional for me, and pointed me to this optimization: My first version mainly contained a match statement on byte slices of length 8, all of which ended with a blank. I figured I could remove that blank, so the code has to read one byte less. But: performance was way worse! It was shown to me that this was the llvm recognized reading 8 bytes as reading a u64, but reading 7 bytes could not be optimized this way. So I employed byteorder to leverage that even further, and maybe that’s something that can help you, too.

I’m not sure why this is a problem. I’m also doing Rust/wasm32 developmnet. My setup is that I have 3+ crates:

  • app/ :: contains actual code
  • app-debug/ :: imports app, builds in debug mode via cargo web start
  • app-deploy/ :: imports app, builds in release mode, builds via ‘cargo build’, very rarely used, only used when I need to deploy to AWS

So you end up developing on debug, and deploying on release.

image

This should be using a precomputed HashMap to dispatch it: I recommend using ::phf and its precomputed perfect hash functions to do this:

use ::phf::{
    phf_map,
    Map,
};

fn parse_hdr<'bytes> (
    data: &'bytes [u8],
    buf: &'bytes [u8; 64],
    table: &'_ [u8; 256],
) -> Result<
        HdrName<'bytes>,
        InvalidHeaderName,
     >
{
    let len = data.len();

    assert!(
        len < MAX_HEADER_NAME_LEN,
        "header name too long: max length is {}",
        MAX_HEADER_NAME_LEN,
    );

    Ok(match len {
        | 0 => {
            return Err(InvalidHeaderName::default());
        },
        | len if len > 64 => {
            HdrName::custom(data, false)
        },
        | _ => {
            use self::StandardHeader::*;
            static STANDARD_HEADERS: Map<&[u8], StandardHeader> = phf_map! {
                b"te"  => Te,
                b"age" => Age,
                b"via" => Via,
                b"dnt" => Dnt,
                // ...
            };
            let buf = &buf[.. len];
            match STANDARD_HEADERS.get(buf) {
                | Some(&standard_header) => {
                    standard_header.into()
                },
                | None => {
                    if buf.contains(&b'\0') {
                        return Err(InvalidHeaderName::default());
                    } else {
                        HdrName::custom(buf, true)
                    }
                },
            }
        },
    })
}
  • The currently released version, 0.7.24,
    requires the nightly #![feature(proc_macro_hygiene)],
    but the repository is currently working on the next 0.8.0 release,
    which fixes it so that the macro can be used in stable rust.

    • you can thus, in the meantime, work with it in stable rust by using the following Cargo.toml:

      [dependencies]
      phf = { version = "0.7.24", features = ["macros"] }
      
      [patch.crates-io]
      phf = { git = "https://github.com/sfackler/rust-phf" }
      
      • Here is a gist to test it.

I don’t have a ton of experience with this kind of problem, but the aho-corasick crate generates a finite state machine that I would think would work better than a simple match statement.

Edit:

After reading some more it sounds like the crate I suggested does more than you need and that a precomputed hash table as suggested may be the best bet. You could also try a BTreeMap or a radix trie if you want some other ideas of things to try.

https://github.com/BurntSushi/aho-corasick/blob/master/README.md

Does this mean the compiler is not smart enough to transform a match statement into hashed lookups on its own? Why can’t this (or a similar optimization) be done automatically?

The compiler does not transform a match statement into hashed lookups, because:

  • match statements are one of the core building blocks of thr language; more high-level abstractions should be built on top of it, not within it;

  • and mainly because it would go against Rust’s explicitness, leading, for instance, to zero-cost abstractions ceasing to be such:

    • Imagine the match statement did perform implicit hash-table lookups.

      Then imagine deciding to create a supposedly zero-cost type-level abstraction, such as:

      #[derive(Clone, Copy)]
      struct HeaderBytes<'a> /* = */ (
          &'a[u8],
      );
      
      let buf = HeaderBytes(&buf[.. len]);
      
      match buf {
          | HeaderBytes("te") => ...,
          ...
      }
      

      And then suddenly we are back to “slow” / classic match statement because our newtype is not hashable, and all this, again, happening implicitly.

  • another reason would be that hash-table lookup only becomes more efficient after a certain point, very hard to guess exactly by a “smart” compiler. This would mean that in some cases the compiler would choose a construct slower than the optimal one, while the programmer is oblivious to all this because of it being implicit.

On the other hand, this is a classical situation where a macro can provide this behavior: using a macro would make the hash-based nature of the lookup no longer implicit, thus, for instance, allowing it to make compilation fail with the above newtype example (complaining against the lack of a Hash implementation on the newtype).

2 Likes