This is almost certainly just me missing something, but, well, my async doesn't seem to actually be async. I'm using Tokio to create a project that interfaces with the web using 10 different threads that each run a function I made.
This is my main function:
use tokio::{self, task::JoinHandle};
use futures::future::join_all;
use lookup::*;
#[tokio::main]
async fn main() {
let mut handles = Vec::<JoinHandle<()>>::new();
for i in 0..NUM_BATCHES {
handles.push(tokio::spawn(async move {
generate_training_data(BATCH_SIZE, format!("./data/data_{}.csv", i)).await.unwrap();
}));
}
join_all(handles).await;
}
I have it set up so that each thread shows a progress bar. When I run the program, only the last progress bar moves. When it finishes, the other bars don't execute. Only the first file batch is saved.
The generate_training_data function:
use scraper::{Html, Selector};
use reqwest;
use regex::Regex;
use indicatif::ProgressBar;
use serde::{Deserialize, Serialize};
use csv;
pub async fn generate_training_data(num_samples: usize, output_path: String) -> std::io::Result<()> {
let mut writer = csv::Writer::from_path(output_path).unwrap();
let bar = ProgressBar::new(num_samples as u64);
let example_selector = Selector::parse("div.samples-list > article > div.samples-list__item__content").unwrap();
let title_selector = Selector::parse("title").unwrap();
let re = Regex::new(r#" \(.*\)"#).unwrap();
for _ in 0..num_samples {
match reqwest::get(RANDOM_URL).await {
Ok(page) => {
match page.text().await {
Ok(doc) => {
let html = Html::parse_document(&doc);
let full_title = html
.select(&title_selector)
.last()
.unwrap()
.text()
.collect::<String>();
let split: Vec<&str> = full_title.split(" - ").collect();
let acronym = split[0].to_string();
let full_def = split[1].replace(" | AcronymFinder", "");
let def = re.replace(&full_def, "").to_string();
for item in html.select(&example_selector) {
let text = re.replace(&item.text().collect::<String>(), "").to_string();
let data = Data { text: text, abbr: acronym.clone(), definition: def.clone() };
match writer.serialize(data) {
Ok(_) => { }
Err(e) => {
println!("{e}");
}
};
}
}
Err(e) => { println!("WARNNG: {e}"); }
}
}
Err(e) => { println!("WARNING: {e}"); }
}
bar.inc(1);
}
bar.finish_with_message("Done!");
writer.flush().unwrap();
Ok(())
}
Async is my enemy, but definitely necessary for a web scraper.
A little more debugging I did: I tried changing the join_all to try_join_all and matching the result, and it sent an Ok value. Not really sure what this means.
Ok, edit: Turns out I just needed a let binding. Dumb question, like I said. Now it's giving me a different error, though... Not what I originally posted, I know, but now I get error trying to connect: dns error: task _ was cancelled
. Love async!!