Dumb question about async

This is almost certainly just me missing something, but, well, my async doesn't seem to actually be async. I'm using Tokio to create a project that interfaces with the web using 10 different threads that each run a function I made.

This is my main function:

use tokio::{self, task::JoinHandle};
use futures::future::join_all;
use lookup::*;

#[tokio::main]
async fn main() {
    let mut handles = Vec::<JoinHandle<()>>::new();
    for i in 0..NUM_BATCHES {
        handles.push(tokio::spawn(async move {
            generate_training_data(BATCH_SIZE, format!("./data/data_{}.csv", i)).await.unwrap();
        }));
    }
    
    join_all(handles).await;
}

I have it set up so that each thread shows a progress bar. When I run the program, only the last progress bar moves. When it finishes, the other bars don't execute. Only the first file batch is saved.

The generate_training_data function:

use scraper::{Html, Selector};
use reqwest;
use regex::Regex;
use indicatif::ProgressBar;
use serde::{Deserialize, Serialize};
use csv;

pub async fn generate_training_data(num_samples: usize, output_path: String) -> std::io::Result<()> {
    let mut writer = csv::Writer::from_path(output_path).unwrap();
    let bar = ProgressBar::new(num_samples as u64);
    let example_selector = Selector::parse("div.samples-list > article > div.samples-list__item__content").unwrap();
    let title_selector = Selector::parse("title").unwrap();
    let re = Regex::new(r#" \(.*\)"#).unwrap();

    for _ in 0..num_samples {
        match reqwest::get(RANDOM_URL).await {
            Ok(page) => {
                match page.text().await {
                    Ok(doc) => {
                        let html = Html::parse_document(&doc);

                        let full_title = html
                            .select(&title_selector)
                            .last()
                            .unwrap()
                            .text()
                            .collect::<String>();

                        let split: Vec<&str> = full_title.split(" - ").collect();
                        let acronym = split[0].to_string();
                        let full_def = split[1].replace(" | AcronymFinder", "");
                        let def = re.replace(&full_def, "").to_string();

                        for item in html.select(&example_selector) {
                            let text = re.replace(&item.text().collect::<String>(), "").to_string();
                            let data = Data { text: text, abbr: acronym.clone(), definition: def.clone() };
                            
                            match writer.serialize(data) {
                                Ok(_) => { }
                                Err(e) => {
                                    println!("{e}");
                                }
                            };
                        }
                    }

                    Err(e) => { println!("WARNNG: {e}"); }
                }
            }

            Err(e) => { println!("WARNING: {e}"); }
        }
        bar.inc(1);
    }
    bar.finish_with_message("Done!");

    writer.flush().unwrap();
    Ok(())
}

Async is my enemy, but definitely necessary for a web scraper.

A little more debugging I did: I tried changing the join_all to try_join_all and matching the result, and it sent an Ok value. Not really sure what this means.

Ok, edit: Turns out I just needed a let binding. Dumb question, like I said. Now it's giving me a different error, though... Not what I originally posted, I know, but now I get error trying to connect: dns error: task _ was cancelled. Love async!!

I think you should take a look into your progress bars; it does not look like the crates example MultiProgress. (p.s. not used it so only going by example.)

If you are also seeing network errors — the web site might not appreciate you sending lots of requests rapidly. Standard practice is to limit the overall number of requests per second you send, whereas your program only limits to 10 concurrent requests which is not a rate limit.

The error I posted about here after the edit was code-related, but I fixed it now by rewriting some of it (to be better lol). That was when I was just testing it with a few requests, now that I'm actually using it for proper datascraping I added a rate-limit

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.