Why is RUST traversing directories slower than node

hi, my test:

Directory files count: 600,000
rust duration: 17s
node fs.js duration: 8s

I think there is something wrong in my code,help me ,please,and tell me why ,thank you

rust

use std::fs ;
use std::path;
use std::time::{Duration, SystemTime};


fn main() {
    let mut _collections:Vec<path::PathBuf> = vec![];
    let sys_time = SystemTime::now();
    _collections = get_files("../../");
    println!("{}", _collections.len());
    println!("=={:?}ms", SystemTime::now().duration_since(sys_time).unwrap().as_millis())
}

fn get_files(path:&str) -> Vec<path::PathBuf>{
    let mut files = vec![];
    let mut dirs  = vec![];
    let _path = path::Path::new(path);
    dirs.push(_path.to_path_buf());
    while !dirs.is_empty() {
        let _path = dirs.pop().unwrap();
        let _files = fs::read_dir(_path).unwrap();
        for file in _files.into_iter() {
            let file_path= file.unwrap().path();
            if file_path.is_dir() && !file_path.is_symlink(){
                dirs.push(file_path);
            }else{
                files.push(file_path);
            }
        }
    }
    files
}

node.js

const fs = require('fs/promises');
const Path = require('path');

async function getFile(path, collector){
  path = Path.resolve(path);
  let files = await fs.readdir(path, {withFileTypes: true});
  for (file of files){
    if(file.isDirectory() && !file.isSymbolicLink()){
     await getFile(Path.resolve(path, file.name), collector);
    }else{
      collector.push(file.name);
    } 
  }
}
let collector = [];
(async function(){
  console.time('Node==')
  await getFile('../../', collector);
  console.log(collector.length);
  console.timeEnd('Node==');
})();


If I use walkdir, it will be faster than NODE.

Did you build with --release?

The single most important Rust performance tip is simple but easy to overlook: make sure you are using a release build rather than a debug build when you want high performance. This is most often done by specifying the --release flag to Cargo.

A release build typically runs much faster than a debug build. 10-100x speedups over debug builds are common!

1 Like

You can also combine walkdir with eg rayon, to walk the tree in parallel, which generally massively speeds up such an operation.

yes , I use cargo build --release command

Does this make a difference?

         for file in _files.into_iter() {
             let file_path= file.unwrap().path();
-            if file_path.is_dir() && !file_path.is_symlink(){
+            let meta = std::fs::symlink_metadata(&file_path).unwrap();
+            if meta.is_dir() {
                 dirs.push(file_path);
             }else{
                 files.push(file_path);
             }
         }

(You're calling stat on every entry (resolving symlinks) and then lstat on every directory; the above calls lstat on every entry.)

2 Likes

Did you happen to mean std::fs::symlink_metadata(&file_path).unwrap()?

1 Like

should be simplified into

    while let Some(_path)=dirs.pop() {

btw, do you execute sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' before the both program execute?
the program runs first should be slower since it does not gain benefit from the cache.

4 Likes

Yes, thanks.

Are you running it with cargo run --release as well?


hi, guy, I did the optimization according to your plan, still slow

1 Like

So isn't the solution to simply use walkdir?

I don't know about the Node version, but as an iterator, WalkDir probably isn't collecting things into Vecs which you need to allocate, populate, and then afterwards iterate over (and probably deallocate). If you made your version recursive, you could remove the dirs Vec, but you'd still be creating then separately iterating the files one.

hi,guy, my node version is v16.15.0

but you'd still be creating then separately iterating the files one.

how can i do ?

This matters on my machine. I just copy and run the code given by OP.

$ cargo r -q -r # first time
281483
==13159ms
$ cargo r -q -r # second time
281483
==3959ms
$ cargo r -q -r
281483
==1443ms
$ cargo r -q -r
281483
==1109ms
$ cargo r -q -r
281483
==1137ms
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ cargo r -q -r # first time
281483
==18544ms
$ cargo r -q -r # second time
281483
==1927ms

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ cargo r -q -r
281483
==18747ms

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ cargo r -q -r
281483
==18557ms

You could read the source code of walkdir... but sure, let's bang out a quick and dirty iterator and see how it compares for you.

struct GetFiles {
    stack: Vec<PathBuf>,
    current: ReadDir,
}

impl Iterator for GetFiles {
    type Item = PathBuf;
    fn next(&mut self) -> Option<Self::Item> {                                   
        loop {
            if let Some(file) = self.current.next().and_then(|f| f.ok()) {
                let file = file.path();
                if fs::symlink_metadata(&file).map(|m| m.is_dir()).unwrap_or(false) {
                    self.stack.push(file);
                } else {
                    return Some(file);
                }   
            } else {
                let dir = self.stack.pop()?;
                self.current = fs::read_dir(dir).ok()?;
            }
        }
    }
}

impl GetFiles {
    fn new(path: &Path) -> Self {
        let stack = <_>::default();
        let current = fs::read_dir(path).unwrap().into_iter();
        Self { stack, current, }
    }
}

fn main() {
    let sys_time = SystemTime::now();
    let count = GetFiles::new("../../".as_ref()).count();
    println!("{count}");
    println!("=={:?}ms", SystemTime::now().duration_since(sys_time).unwrap().as_millis());
}

Though it didn't make a big difference for me. The caches @Neutron3529 mentioned make a big difference though, so be sure you're timing what you think you are.

2 Likes

thank you . i executed sync && sudo purge on macos ,the same to linux cmd

from: linux - echo 3 > /proc/sys/vm/drop_caches on Mac OSX - Stack Overflow

thank you ,I tried running your code, same performance as 16s. I'm reading the source code of walkdir and I feel it's too complicated to write high performance 'FileSystem' code in rust

The following change (querying file type via DirEntry) resulted in significant speedup for me; at a guess the DirEntry has the metadata already and a syscall is avoided.

impl Iterator for GetFiles {
    type Item = PathBuf; // DirEntry;
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            if let Some(entry) = self.current.next().and_then(|f| f.ok()) {
                if entry.file_type().ok()?.is_dir() {
                    self.stack.push(entry.path());
                } else {
                    return Some(entry.path());
                }
            } else {
                let dir = self.stack.pop()?;
                self.current = fs::read_dir(dir).ok()?;
            }
        }
    }
}

I don't know how you were comparing with walkdir, but it returns DirEntrys and thus doesn't even create a PathBuf for everything (which inspired the above change -- I went back to returning PathBuf because it didn't make a big difference for me).

(I'm on Linux, not Mac.)

3 Likes

so amazing. I ran your code and the time was increased to 7s, the same as walkdir

at a guess the DirEntry has the metadata already and a syscall is avoided