How to use async reqwest with multithreading

Hello,
I'm new to Rust and I created a tool(GitHub - Neolex-Security/WaybackRust: WaybackRust is a tool written in Rust to query the WaybackMachine.) that make a lot of request.
I'm trying to switch from the old version of reqwest used in github to the new one with async.
So I removed the threadpool and used async function and add .await after the reqwest::get calls.

The issue is that now my code is way slower than before, I guess because of the fact that it's not multi threaded anymore.

So I have 2 questions :

  • How should I make my new version using async multi threaded ? Should I still use the threadpool crates ? or something else ?
  • The second question is, why when I do only one request without threadpool the async version is still slower ? Am I doing something wrong ?
    why the first code here is faster than the second one
// without using async version of reqwest : (faster, why?)
reqwest::get(url.as_str())
            .expect("Error GET request")
            .text()
            .expect("Error parsing response")
            .lines()
            .map(|item| item.to_string())
            .filter(|file| whitelist.iter().any(|ext| file.ends_with(ext)))
            .collect()

// with using async version of reqwest : (slower)
reqwest::get(url.as_str()).await
            .expect("Error GET request")
            .text().await
            .expect("Error parsing response")
            .lines()
            .map(|item| item.to_string())
            .filter(|file| whitelist.iter().any(|ext| file.ends_with(ext)))
            .collect()

Thank you

Sorry for my english, but i will try to explain it,

for the first question, you can use rayon to get multithreathing and get a multithreahd in async way of getting the urls

second the interaction of the async with the operating system is more faster than the sync version but in the code that you have posted the operations with the file are not async, and this is a problem.

can you post a benchmark to see the diferences? you are compiling with --release?

Is each individual request slower, or are you seeing less parallelism? As for how you do multithreading, you can spawn each request on a separate Tokio task with tokio::spawn. You should not try to make async code multithreaded by using rayon. Rayon is for non-async heavy computations, not IO.

1 Like

I think each individual request is slower because when I call "run_url()" which makes only one request it is significantly lower on the new version(even with --release)

Normally I would expect async reqwest to perform about the same as sync. Can you give more details? Perhaps a minimal example with the issue or link to code?

Can you benchmark with this 6 cases

1 thread sync
1 thread async
2 threads sync
2 threads async
4 threads sync
4 threads async

if you can't show all of the code

here is the time waybackrust urls google.com -n command result on the old one and the new one :

  • old
real	0m7.682s
user	0m0.199s
sys	0m0.086s

  • new
real    0m16.222s
user    0m0.662s
sys     0m0.390s

in the github WaybackRust/main.rs at master · Neolex-Security/WaybackRust · GitHub

two questions?

i supposse that is the "older" version correct?
your operating system is windows?

It doesn't really say anything if I can't see the code.

Yes it's the older version, and my OS is Linux (Debian 10)

I tried to make a minimal example with this request and the map etc but like you said the time is approximately the same for both so I'm kinda lost in why my project is slower with only this change..

Can you post the new version when you try to call various threads with the async way?

Sorry,
here is the two version of the code, the "new" one is way slower :

//new 
extern crate reqwest;
#[tokio::main]
async fn main() {
    let blacklist: Vec<String> = vec!["js".to_string(), "png".to_string()];
    let _urls: Vec<String> = reqwest::get("http://web.archive.org/cdx/search/cdx?url=google.com/*&output=text&fl=original&collapse=urlkey").await
            .expect("Error GET request")
            .text().await
            .expect("Error parsing response")
            .lines()
            .map(|item| item.to_string())
            .filter(|file| !blacklist.iter().any(|ext| file.ends_with(ext)))
            .collect(); 

    println!("{}", _urls.join("\n"));
}

// new 
real	0m20.230s
user	0m0.526s
sys	0m0.542s

// old
extern crate reqwest;

fn main() {
    let blacklist: Vec<String> = vec!["js".to_string(), "png".to_string()];
    let _urls: Vec<String> = reqwest::get("http://web.archive.org/cdx/search/cdx?url=google.com/*&output=text&fl=original&collapse=urlkey")
            .expect("Error GET request")
            .text()
            .expect("Error parsing response")
            .lines()
            .map(|item| item.to_string())
            .filter(|file| !blacklist.iter().any(|ext| file.ends_with(ext)))
            .collect();

    println!("{}", _urls.join("\n"));
}

// old 
real	0m6.547s
user	0m0.169s
sys	0m0.118s

Ok, now in this two codes, is only using 1 thread

can you try at least ten times to see if the diference in the speed of the response from web.archive or in the internals of the reqwest

and i repeat that in the async way your file is used in the sync way (it's not a problem but a missed oportunity to optimize)

I can reproduce the issue locally. The issue appears to be specific to the wayback machine url.

What are you referring to here?

i'm programing a scrapper with the reqwest and i think that after the get the response, i'm getting big files from the web you can do something like:

                    match tokio::fs::File::create(new_file).await {
                        Ok(mut file) => {
                            while let Some(chunk) = match doc1.chunk().await {
                                Ok(x) => x,
                                Err(e) => {
                                    imprimir_amarillo(
                                        "Fallo al obtener datos de la imagen, borrar directorio",
                                    );
                                    println!("Error: {}", e);
                                    borrar_directorio_y_panic(&directorio_capitulo);
                                }
                            } {
                                match file.write_all(&chunk).await {
                                    Ok(_) => (),
                                    Err(e) => {
                                        imprimir_amarillo(
                                            "Fallo al escribir en el fichero, borrar directorio",
                                        );
                                        println!("Error: {}", e);
                                        borrar_directorio_y_panic(&directorio_capitulo);
                                    }
                                }
                            }
                        }
                        Err(e) => {
                            imprimir_amarillo("Error al crear fichero, borrar directorio");
                            println!("Error: {}", e);
                            borrar_directorio_y_panic(&directorio_capitulo);
                        }
                    }

and the line of code

        .filter(|file| !blacklist.iter().any(|ext| file.ends_with(ext)))

i was thinking that the results are going to a file.... my mistake

So you have no idea why the issue happen ?
If the request sent is the same the wayback machine should respond in the same time, right ?

I would post a bug on the github with the nice minimal example you posted.

On which github ? sorry

https://github.com/seanmonstar/reqwest