Code review request. Why is this Rust code 3 times slower than Go

What is fundamentally wrong with this code if it runs 3 times slower than similar Go code? What am I doing wrong? Yes, I ran it in release mode. Yes, it could be a one line code using collection traits. I just want to know where the bottleneck is.

struct Word {
    text: String,
}

fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = String::from_utf8_lossy(&v[lh..i]).to_string();
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

Here is its Go counterpart:

type Word struct {
	Text string
}

func getWords(bs []byte) []Word {
	var (
		words []Word
		lh    int
	)

	for i := 0; i < len(bs); i++ {
		b := bs[i]
		if b == ' ' || b == '\r' || b == '\n' {
			if i > lh {
				words = append(words, Word{
					Text: string(bs[lh:i]),
				})
			}
			lh = i + 1
		}
	}

	return words
}
2 Likes

I haven’t tried it out, but a Go string is just bytes, right? That would make the equivalent Rust code use

struct Word {
    text: Vec<u8>,
}

instead, and something like Vec::from(&v[lh..i]) instead of from_utf8_lossy. The latter has some extra overhead to check for a valid UTF-8 string.

Strings everywhere are just bytes. And I do safe (meaning, not the fastest) conversion from bytes to string in my Go version. But, even after updating Rust to make text field Vec and using Vec::from(&v[lh..i]), the result is the same -- Go is 3 times faster.

Here you're constructing String from byte slice using String::from_utf8_lossy, duplicate it from its content using .to_string(), and immediately dropping the original one by not using it. Fixing it may gives huge boost but I'd not expect it to beat the Go version since

  1. Go doesn't care for its string type contains invalid UTF-8 sequence so it doesn't check it. String::from_utf8_lossy check it and perform lossy conversion.

  2. For small programs of this size it's common for garbage collectors not run even once. By not running it it not only skips GC tracking cost but also heap deallocation cost entirely. Rust doesn't have GC and it should free allocations whenever it's not used.

7 Likes

from_utf8_lossy() returns a Cow, so it’s not quite that bad, but should probably be into_owned() anyway.

I changed my Rust to this:

struct Word {
    text: Vec<u8>,
}

fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = Vec::from(&v[lh..i]);
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

So, it's pure bytes now, no checking for invalid UTF-8.
I wouldn't post this topic if results were close: a few precents give or take. But this is three fold difference. Not calling GC vs not having GC should at least be equal.

Idiomatically, I recommend changing the argument to v: &[u8], and the loop to for (i, b) in v.iter().enumerate() {.

Both changes have potential performance improvements, although I don't expect it to make up the 3x difference. Can you share any of the calling code or the data passed in so the performance difference is reproducible?

2 Likes
fn main() {
    let v = std::fs::read("story.txt").unwrap();

    let sys_time = std::time::SystemTime::now();
    let mut count = 0;
    loop {
        let _words = get_words(&v);

        count += 1;
        if count == 100_000 {
            break;
        }
    }

    println!("{}: {:?}", count, sys_time.elapsed().unwrap());
}

vs

func main() {
	bs, _ := os.ReadFile("story.txt")
	count := 0
	start := time.Now()

	for {
		_ = getWords(bs)
		count++
		if count == 100_000 {
			break
		}
	}

	fmt.Println(count, time.Since(start))
}

And I used this for my text file: https://doc.rust-lang.org/book/foreword.html )))

As I mentioned it is not. With using C terms for familiarity, by not having GC you need to pay for the cost of malloc/free. By having GC you normally need to pay for the malloc/free and additional cost for the GC tracking infrastructure. But if you don't call the GC it skips not just GC tracking cost but also entire free-ing cost.

Also Go uses tcmalloc which usually is far better than system default allocator. You can try different allocators in Rust.

https://docs.rs/crate/mimalloc/latest

3 Likes

This is not real code, I came up with it so I can post it here. The real code is XML parser and the Go version most likely uses GC. So not calling GC at all is cheaper than handle memory in code. Does it mean that small size programs written in GC languages will always outperform Rust?

It depends how you measure things. I'm pretty sure in this test you're not measuring runtime initialization cost.

I run each version 100,000 times and only start the clock after it is running

Also, I am not defending Go vs Rust here. I am new to Rust and I love what it promises. But it has a steeper learning curve and I just want to make sure this is the right investment.

1 Like

Yeah language wars are useless unless purely for fun. I'm explaining differences between languages and how to address them here.

Rust:

#![allow(dead_code)]
fn main() {
    let v = std::fs::read("story.txt").unwrap();

    let sys_time = std::time::SystemTime::now();
    let mut count = 0;
    loop {
        let _words = get_words(&v);

        count += 1;
        if count == 100_000 {
            break;
        }
    }

    println!("{}: {:?}", count, sys_time.elapsed().unwrap());
}

struct Word {
    text: Vec<u8>,
}

fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = Vec::from(&v[lh..i]);
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

Go:

package main

import (
	"fmt"
	"os"
	"time"
)

func main() {
	bs, _ := os.ReadFile("story.txt")
	count := 0
	start := time.Now()

	for {
		_ = getWords(bs)
		count++
		if count == 100_000 {
			break
		}
	}

	fmt.Println(count, time.Since(start))
}

type Word struct {
	Text string
}

func getWords(bs []byte) []Word {
	var (
		words []Word
		lh    int
	)

	for i := 0; i < len(bs); i++ {
		b := bs[i]
		if b == ' ' || b == '\r' || b == '\n' {
			if i > lh {
				words = append(words, Word{
					Text: string(bs[lh:i]),
				})
			}
			lh = i + 1
		}
	}

	return words
}

On my non high-end machine, the Rust version is faster [1]

$ hyperfine "./a" -N --warmup 3 # go
Benchmark 1: ./a
  Time (mean ± σ):      2.599 s ±  0.175 s    [User: 2.453 s, System: 0.029 s]
  Range (min … max):    2.192 s …  2.858 s    10 runs

$ hyperfine "./target/release/urlo-bench" -N --warmup 3 # rust
Benchmark 1: ./target/release/urlo-bench
  Time (mean ± σ):      2.142 s ±  0.174 s    [User: 1.998 s, System: 0.006 s]
  Range (min … max):    2.062 s …  2.636 s    10 runs

  1. based on the last snippet but story.txt is just copied from the webpage (non-html) ↩︎

2 Likes

Just to note the how much difference allocation make in term of speed, the example I have below perform no allocation per Word.
This version is 10x faster than your rust sample (which implies this is 3x faster than your go version).

The main difference is that Word now holds a slice of string, and that I convert the Vec<u8> to String with the same String::from_utf8_lossy, thus only need one check and one additional allocation (though you can simply just read the file in as utf8 instead of u8's to avoid this).

Not saying changing to this version is the correct choice, just to highlight the difference it makes, especially when that allocation is in a inner loop.

Note: your version runs in 2.7s on my computer, this version runs in 250ms

struct Word<'a> {
    text: &'a str,
}

fn get_words(v: &String) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v.as_bytes()[i].is_ascii_whitespace() {
            if i > lh {
                words.push(Word { text: &v[lh..i] });
            }
            lh = i + 1;
        }
    }

    words
}

fn main() {
    let v = std::fs::read("story.txt").unwrap();

    let sys_time = std::time::SystemTime::now();
    let v = String::from_utf8_lossy(&v).into_owned();
    let mut count = 0;
    loop {
        let _words = get_words(&v);

        count += 1;
        if count == 100_000 {
            break;
        }
    }

    println!("{}: {:?}", count, sys_time.elapsed().unwrap());
}
4 Likes

Cool, almost same speed too

$ hyperfine  "./a" "./target/release/urlo-bench" -N --warmup 3
Benchmark 1: ./a
  Time (mean ± σ):      2.363 s ±  0.181 s    [User: 2.216 s, System: 0.024 s]
  Range (min … max):    2.232 s …  2.862 s    10 runs

Benchmark 2: ./target/release/urlo-bench
  Time (mean ± σ):     257.6 ms ±   2.7 ms    [User: 252.1 ms, System: 0.0 ms]
  Range (min … max):   254.0 ms … 261.9 ms    11 runs

Summary
  ./target/release/urlo-bench ran
    9.17 ± 0.71 times faster than ./a
1 Like

I want to note that using SystemTime for benchmarking is very definitely wrong.

2 Likes

I decided to get methodical about optimizing your code, so here I go. First, I made a flamegraph of your original code's performance.
Here's my flamegraph:

Some observations:

  • Most of the time is spent on allocation (22% malloc, 20% free, and 9% realloc).
  • 17% of the time is spent on String::from_utf8_lossy.

Using the recommendations from prior commentors, here's an implementation that doesn't check the String UTF-8 correctness, much like Go:


fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = unsafe { String::from_utf8_unchecked(v[lh..i].to_vec()) };
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

This is already a little faster, clocking in at around 1.03s for me. I'm a new user, so I can't embed more flamegraphs, but it looks about the same with String::from_utf8_lossy removed.

Before I try to eliminate some allocations, I'll try to reach parity with the Go code with one more step. Vec contains a pointer to its backing allocation, so using &Vec adds an extra layer of indirection than []byte. We can implement this using only one changed line:

fn get_words(v: &[u8]) -> Vec<Word> { /* ... */ }

However, this has almost no impact on our performance, since the allocations were the bottleneck. I suspect that a mix of not needing to free things, plus using a faster allocator, accounts for the remaining difference.

For a final revision, here's the most idiomatic way that I would do it:

fn get_words_idiomatic(v: &str) -> Vec<&str> {
    v.split_ascii_whitespace().collect()
}

This runs in only 230ms for me - a 6.7x improvement!

16 Likes