[Solved]How quickly [u8] convert to Vec<char>?

How quickly [u8] convert to Vec?

I use blow function,but it very slow.I use python cost 25 sec,in rust cost 43 sec

fn word_counter(path: String, words: &mut HashMap<char, i32>) {
let mut f = File::open(path).expect("canot read file");
let mut buf = Vec::new();
f.read_to_end(&mut buf).expect("read file fail");

let contents = String::from_utf8_lossy(&mut buf);
for c in contents.chars() {
    let stat = words.entry(c).or_insert(0);
    *stat += 1;
}

}

You can read a file to a string using

f.read_to_string(&mut string).unwrap();

fs::read_to_string can be even faster.

Are you running the Rust program in debug mode or release mode?

3 Likes

Did you compile in release mode? By default using cargo run will execute a program without any optimisations (because it builds faster and is easier to debug), but that can often be 10x slower than when you compile with optimisations.

I tried this on my computer and the naive Rust program was about 5x faster than Python.

The Python program:

import sys
from collections import defaultdict

filename = sys.argv[1]

words = defaultdict(int)

with open(filename) as f:
    for line in f:
        for word in line.split():
            words[word] += 1;

print(len(words))

And translated to Rust:

use std::{env, fs::File, io::Read, collections::HashMap};

fn main() {
    let filename = env::args().nth(1).expect("Usage: temp <filename>");

    let mut words = HashMap::new();

    let mut f = File::open(filename).unwrap();
    let mut buffer = Vec::new();
    f.read_to_end(&mut buffer).unwrap();

    let content = String::from_utf8(buffer).unwrap();

    for word in content.split_whitespace() {
        let occurrences = words.entry(word).or_insert(0);
        *occurrences += 1;
    }

    println!("Found {} words", words.len());
}

Running it against a random word list I had lying around:

$ wc --words /usr/share/dict/british-english
  101921 /usr/share/dict/british-english
$ time python main.py /usr/share/dict/british-english     
  101921
  python main.py /usr/share/dict/british-english  0.06s user 0.00s system 99% cpu 0.068 total
$ cargo build
$ time ./target/debug/temp /usr/share/dict/british-english  
  Found 101921 words
  ./target/debug/temp /usr/share/dict/british-english  0.25s user 0.00s system 99% cpu 0.250 total
$ cargo build --release
$ time ./target/release/temp /usr/share/dict/british-english
  Found 101921 words
  ./target/release/temp /usr/share/dict/british-english  0.01s user 0.01s system 97% cpu 0.020 total
3 Likes

I started using to_string, but my files has a lot of invalid utf8 bytes, I have to use String::from_utf8_lossy

Solved, I used release, it only took 5 secondsļ¼
thank you very muchļ¼

I tried it a few times and most of it was done in 1-2 seconds. Too unbelievableļ¼
I didn't do anything, just changed

$ cargo run

to

$ cargo run --release

1 Like

You are right, release mode is a lot faster.
Thanks

Maybe it's not UTF-8 then?

1 Like