Code review request. Why is this Rust code 3 times slower than Go

ymetelkin · July 10, 2023, 1:04am

What is fundamentally wrong with this code if it runs 3 times slower than similar Go code? What am I doing wrong? Yes, I ran it in release mode. Yes, it could be a one line code using collection traits. I just want to know where the bottleneck is.

struct Word {
    text: String,
}

fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = String::from_utf8_lossy(&v[lh..i]).to_string();
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

Here is its Go counterpart:

type Word struct {
	Text string
}

func getWords(bs []byte) []Word {
	var (
		words []Word
		lh    int
	)

	for i := 0; i < len(bs); i++ {
		b := bs[i]
		if b == ' ' || b == '\r' || b == '\n' {
			if i > lh {
				words = append(words, Word{
					Text: string(bs[lh:i]),
				})
			}
			lh = i + 1
		}
	}

	return words
}

carey · July 10, 2023, 1:35am

I haven’t tried it out, but a Go string is just bytes, right? That would make the equivalent Rust code use

struct Word {
    text: Vec<u8>,
}

instead, and something like Vec::from(&v[lh..i]) instead of from_utf8_lossy. The latter has some extra overhead to check for a valid UTF-8 string.

ymetelkin · July 10, 2023, 1:42am

Strings everywhere are just bytes. And I do safe (meaning, not the fastest) conversion from bytes to string in my Go version. But, even after updating Rust to make text field Vec and using Vec::from(&v[lh..i]), the result is the same -- Go is 3 times faster.

Hyeonu · July 10, 2023, 1:42am

Here you're constructing String from byte slice using String::from_utf8_lossy, duplicate it from its content using .to_string(), and immediately dropping the original one by not using it. Fixing it may gives huge boost but I'd not expect it to beat the Go version since

Go doesn't care for its string type contains invalid UTF-8 sequence so it doesn't check it. String::from_utf8_lossy check it and perform lossy conversion.
For small programs of this size it's common for garbage collectors not run even once. By not running it it not only skips GC tracking cost but also heap deallocation cost entirely. Rust doesn't have GC and it should free allocations whenever it's not used.

carey · July 10, 2023, 1:46am

from_utf8_lossy() returns a Cow, so it’s not quite that bad, but should probably be into_owned() anyway.

ymetelkin · July 10, 2023, 1:49am

I changed my Rust to this:

struct Word {
    text: Vec<u8>,
}

fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = Vec::from(&v[lh..i]);
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

So, it's pure bytes now, no checking for invalid UTF-8.
I wouldn't post this topic if results were close: a few precents give or take. But this is three fold difference. Not calling GC vs not having GC should at least be equal.

geeklint · July 10, 2023, 1:55am

Idiomatically, I recommend changing the argument to v: &[u8], and the loop to for (i, b) in v.iter().enumerate() {.

Both changes have potential performance improvements, although I don't expect it to make up the 3x difference. Can you share any of the calling code or the data passed in so the performance difference is reproducible?

ymetelkin · July 10, 2023, 1:58am

fn main() {
    let v = std::fs::read("story.txt").unwrap();

    let sys_time = std::time::SystemTime::now();
    let mut count = 0;
    loop {
        let _words = get_words(&v);

        count += 1;
        if count == 100_000 {
            break;
        }
    }

    println!("{}: {:?}", count, sys_time.elapsed().unwrap());
}

vs

func main() {
	bs, _ := os.ReadFile("story.txt")
	count := 0
	start := time.Now()

	for {
		_ = getWords(bs)
		count++
		if count == 100_000 {
			break
		}
	}

	fmt.Println(count, time.Since(start))
}

And I used this for my text file: https://doc.rust-lang.org/book/foreword.html )))

Hyeonu · July 10, 2023, 2:02am

As I mentioned it is not. With using C terms for familiarity, by not having GC you need to pay for the cost of malloc/free. By having GC you normally need to pay for the malloc/free and additional cost for the GC tracking infrastructure. But if you don't call the GC it skips not just GC tracking cost but also entire free-ing cost.

Also Go uses tcmalloc which usually is far better than system default allocator. You can try different allocators in Rust.

https://docs.rs/crate/mimalloc/latest

ymetelkin · July 10, 2023, 2:17am

This is not real code, I came up with it so I can post it here. The real code is XML parser and the Go version most likely uses GC. So not calling GC at all is cheaper than handle memory in code. Does it mean that small size programs written in GC languages will always outperform Rust?

Hyeonu · July 10, 2023, 2:18am

It depends how you measure things. I'm pretty sure in this test you're not measuring runtime initialization cost.

ymetelkin · July 10, 2023, 2:21am

I run each version 100,000 times and only start the clock after it is running

ymetelkin · July 10, 2023, 2:26am

Also, I am not defending Go vs Rust here. I am new to Rust and I love what it promises. But it has a steeper learning curve and I just want to make sure this is the right investment.

Hyeonu · July 10, 2023, 2:30am

Yeah language wars are useless unless purely for fun. I'm explaining differences between languages and how to address them here.

ymetelkin · July 10, 2023, 2:37am

Rust:

#![allow(dead_code)]
fn main() {
    let v = std::fs::read("story.txt").unwrap();

    let sys_time = std::time::SystemTime::now();
    let mut count = 0;
    loop {
        let _words = get_words(&v);

        count += 1;
        if count == 100_000 {
            break;
        }
    }

    println!("{}: {:?}", count, sys_time.elapsed().unwrap());
}

struct Word {
    text: Vec<u8>,
}

fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = Vec::from(&v[lh..i]);
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

Go:

package main

import (
	"fmt"
	"os"
	"time"
)

func main() {
	bs, _ := os.ReadFile("story.txt")
	count := 0
	start := time.Now()

	for {
		_ = getWords(bs)
		count++
		if count == 100_000 {
			break
		}
	}

	fmt.Println(count, time.Since(start))
}

type Word struct {
	Text string
}

func getWords(bs []byte) []Word {
	var (
		words []Word
		lh    int
	)

	for i := 0; i < len(bs); i++ {
		b := bs[i]
		if b == ' ' || b == '\r' || b == '\n' {
			if i > lh {
				words = append(words, Word{
					Text: string(bs[lh:i]),
				})
			}
			lh = i + 1
		}
	}

	return words
}

vague · July 10, 2023, 3:07am

On my non high-end machine, the Rust version is faster ^[1]

$ hyperfine "./a" -N --warmup 3 # go
Benchmark 1: ./a
  Time (mean ± σ):      2.599 s ±  0.175 s    [User: 2.453 s, System: 0.029 s]
  Range (min … max):    2.192 s …  2.858 s    10 runs

$ hyperfine "./target/release/urlo-bench" -N --warmup 3 # rust
Benchmark 1: ./target/release/urlo-bench
  Time (mean ± σ):      2.142 s ±  0.174 s    [User: 1.998 s, System: 0.006 s]
  Range (min … max):    2.062 s …  2.636 s    10 runs

based on the last snippet but story.txt is just copied from the webpage (non-html) ↩︎

Sykout · July 10, 2023, 3:09am

Just to note the how much difference allocation make in term of speed, the example I have below perform no allocation per Word.
This version is 10x faster than your rust sample (which implies this is 3x faster than your go version).

The main difference is that Word now holds a slice of string, and that I convert the Vec<u8> to String with the same String::from_utf8_lossy, thus only need one check and one additional allocation (though you can simply just read the file in as utf8 instead of u8's to avoid this).

Not saying changing to this version is the correct choice, just to highlight the difference it makes, especially when that allocation is in a inner loop.

Note: your version runs in 2.7s on my computer, this version runs in 250ms

struct Word<'a> {
    text: &'a str,
}

fn get_words(v: &String) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v.as_bytes()[i].is_ascii_whitespace() {
            if i > lh {
                words.push(Word { text: &v[lh..i] });
            }
            lh = i + 1;
        }
    }

    words
}

fn main() {
    let v = std::fs::read("story.txt").unwrap();

    let sys_time = std::time::SystemTime::now();
    let v = String::from_utf8_lossy(&v).into_owned();
    let mut count = 0;
    loop {
        let _words = get_words(&v);

        count += 1;
        if count == 100_000 {
            break;
        }
    }

    println!("{}: {:?}", count, sys_time.elapsed().unwrap());
}

vague · July 10, 2023, 3:11am

Cool, almost same speed too

$ hyperfine  "./a" "./target/release/urlo-bench" -N --warmup 3
Benchmark 1: ./a
  Time (mean ± σ):      2.363 s ±  0.181 s    [User: 2.216 s, System: 0.024 s]
  Range (min … max):    2.232 s …  2.862 s    10 runs

Benchmark 2: ./target/release/urlo-bench
  Time (mean ± σ):     257.6 ms ±   2.7 ms    [User: 252.1 ms, System: 0.0 ms]
  Range (min … max):   254.0 ms … 261.9 ms    11 runs

Summary
  ./target/release/urlo-bench ran
    9.17 ± 0.71 times faster than ./a

HJVT · July 10, 2023, 3:26am

I want to note that using SystemTime for benchmarking is very definitely wrong.

claytonwramsey · July 10, 2023, 3:54am

I decided to get methodical about optimizing your code, so here I go. First, I made a flamegraph of your original code's performance.
Here's my flamegraph:

Some observations:

Most of the time is spent on allocation (22% malloc, 20% free, and 9% realloc).
17% of the time is spent on String::from_utf8_lossy.

Using the recommendations from prior commentors, here's an implementation that doesn't check the String UTF-8 correctness, much like Go:


fn get_words(v: &Vec<u8>) -> Vec<Word> {
    let mut words: Vec<Word> = Vec::new();
    let mut lh: usize = 0;

    for i in 0..v.len() {
        if v[i].is_ascii_whitespace() {
            if i > lh {
                let s = unsafe { String::from_utf8_unchecked(v[lh..i].to_vec()) };
                words.push(Word { text: s });
            }
            lh = i + 1;
        }
    }

    words
}

This is already a little faster, clocking in at around 1.03s for me. I'm a new user, so I can't embed more flamegraphs, but it looks about the same with String::from_utf8_lossy removed.

Before I try to eliminate some allocations, I'll try to reach parity with the Go code with one more step. Vec contains a pointer to its backing allocation, so using &Vec adds an extra layer of indirection than []byte. We can implement this using only one changed line:

fn get_words(v: &[u8]) -> Vec<Word> { /* ... */ }

However, this has almost no impact on our performance, since the allocations were the bottleneck. I suspect that a mix of not needing to free things, plus using a faster allocator, accounts for the remaining difference.

For a final revision, here's the most idiomatic way that I would do it:

fn get_words_idiomatic(v: &str) -> Vec<&str> {
    v.split_ascii_whitespace().collect()
}

This runs in only 230ms for me - a 6.7x improvement!

Topic		Replies	Views
Rust vs Go string manipulation -- performance help	57	6954	January 25, 2020
How to improve rust performance to 1.4 seconds? help	31	2419	June 16, 2023
Hello, I write a simple lru code in rust and go. but the performance have a big gap help	7	983	November 24, 2018
String, &str, when to use Vec<String> or Vec<&str> help	28	9405	August 20, 2023
Rust vs Go benchmark help	11	3842	January 1, 2024

Code review request. Why is this Rust code 3 times slower than Go

Related topics