I'm trying to create a scraper program that uses Tokio.
I have two (I think) relevant functions:
pub async fn get_images_of_all_species(links: Vec<String>, num_images_per_species: usize, image_width: u32, image_height: u32) -> Result<Vec<(RgbImage, String)>, reqwest::Error> {
let title_selector = Selector::parse("title").unwrap();
let image_selector = Selector::parse("td.node-main-alt > a > img").unwrap();
let mut map = Vec::<(RgbImage, String)>::new();
let mut titles = Vec::<String>::new();
for link in links {
let title_selector = title_selector.clone();
let doc = get_text(link).await?;
let html = Html::parse_document(&doc);
// get title
let title_element = html.select(&title_selector).next().unwrap();
let title_text = title_element
.text()
.collect::<Vec<&str>>()
.join("")
.split("-")
.take(1)
.collect::<String>()
.trim()
.replace("Species ", "");
let num_loops = (num_images_per_species as f64 / 24.0).floor() as usize;
let remainder = num_images_per_species % 24;
let mut data = Vec::<(RgbImage, String)>::new();
// get most images
for i in 0..num_loops {
let link = link.clone();
let title_text = title_text.clone();
let temp_data = tokio::spawn(async move {
let from = i * 24;
let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
let images = get_images_from_page(doc, 24, image_width, image_height).await;
images
});
let new_imgs = temp_data
.await
.unwrap()
.iter()
.map(|img| (img.to_owned(), title_text))
.collect();
data.extend(new_imgs);
}
}
Ok(map)
}
pub async fn get_images_from_page(doc: String, num_images: usize, width: u32, height: u32) -> Vec<RgbImage> {
let mut images = Vec::<RgbImage>::new();
let html = Html::parse_document(&doc);
let image_selector = Selector::parse("td.node-main-alt > a > img").unwrap();
let mut c = 0;
let found_images: Vec<_> = html.select(&image_selector).collect();
for i in 0..found_images.len() {
let src = found_images[i]
.value()
.attrs()
.collect::<HashMap<&str, &str>>()["src"];
images.push(get_image(src.to_string(), width, height).await.unwrap());
c += 1;
if c > num_images {
break;
}
}
images
}
I get a lot of errors relating to the temp_data future. The sort of ugly, full compiler diagnosis:
error: future cannot be sent between threads safely
--> src\bg_scraper\mod.rs:45:42
|
45 | let temp_data = tokio::spawn(async move {
| __________________________________________^
46 | | let from = i * 24;
47 | | let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
48 | |
... |
51 | | images
52 | | });
| |_____________^ future created by async block is not `Send`
|
= help: within `tendril::tendril::NonAtomic`, the trait `Sync` is not implemented for `Cell<usize>`
note: future is not `Send` as this value is used across an await
--> src\bg_scraper\mod.rs:83:62
|
71 | let html = Html::parse_document(&doc);
| ---- has type `Html` which is not `Send`
...
83 | images.push(get_image(src.to_string(), width, height).await.unwrap());
| ^^^^^^ await occurs here, with `html` maybe used later
...
93 | }
| - `html` is later dropped here
note: required by a bound in `tokio::spawn`
--> C:\Users\Salt lick\.cargo\registry\src\github.com-1ecc6299db9ec823\tokio-1.29.0\src\task\spawn.rs:166:21
|
166 | T: Future + Send + 'static,
| ^^^^ required by this bound in `spawn`
error: future cannot be sent between threads safely
--> src\bg_scraper\mod.rs:45:42
|
45 | let temp_data = tokio::spawn(async move {
| __________________________________________^
46 | | let from = i * 24;
47 | | let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
48 | |
... |
51 | | images
52 | | });
| |_____________^ future created by async block is not `Send`
|
= help: within `ego_tree::Node<Node>`, the trait `Sync` is not implemented for `Cell<NonZeroUsize>`
note: future is not `Send` as this value is used across an await
--> src\bg_scraper\mod.rs:83:62
|
75 | let found_images: Vec<_> = html.select(&image_selector).collect();
| ------------ has type `Vec<ElementRef<'_>>` which is not `Send`
...
83 | images.push(get_image(src.to_string(), width, height).await.unwrap());
| ^^^^^^ await occurs here, with `found_images` maybe used later
...
93 | }
| - `found_images` is later dropped here
note: required by a bound in `tokio::spawn`
--> C:\Users\Salt lick\.cargo\registry\src\github.com-1ecc6299db9ec823\tokio-1.29.0\src\task\spawn.rs:166:21
|
166 | T: Future + Send + 'static,
| ^^^^ required by this bound in `spawn`
error: future cannot be sent between threads safely
--> src\bg_scraper\mod.rs:45:42
|
45 | let temp_data = tokio::spawn(async move {
| __________________________________________^
46 | | let from = i * 24;
47 | | let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
48 | |
... |
51 | | images
52 | | });
| |_____________^ future created by async block is not `Send`
|
= help: within `ego_tree::Node<Node>`, the trait `Sync` is not implemented for `UnsafeCell<tendril::tendril::Buffer>`
note: future is not `Send` as this value is used across an await
--> src\bg_scraper\mod.rs:83:62
|
75 | let found_images: Vec<_> = html.select(&image_selector).collect();
| ------------ has type `Vec<ElementRef<'_>>` which is not `Send`
...
83 | images.push(get_image(src.to_string(), width, height).await.unwrap());
| ^^^^^^ await occurs here, with `found_images` maybe used later
...
93 | }
| - `found_images` is later dropped here
note: required by a bound in `tokio::spawn`
--> C:\Users\Salt lick\.cargo\registry\src\github.com-1ecc6299db9ec823\tokio-1.29.0\src\task\spawn.rs:166:21
|
166 | T: Future + Send + 'static,
| ^^^^ required by this bound in `spawn`
error: future cannot be sent between threads safely
--> src\bg_scraper\mod.rs:45:42
|
45 | let temp_data = tokio::spawn(async move {
| __________________________________________^
46 | | let from = i * 24;
47 | | let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
48 | |
... |
51 | | images
52 | | });
| |_____________^ future created by async block is not `Send`
|
= help: within `ego_tree::Node<Node>`, the trait `Sync` is not implemented for `UnsafeCell<Option<Option<tendril::tendril::Tendril<tendril::fmt::UTF8>>>>`
note: future is not `Send` as this value is used across an await
--> src\bg_scraper\mod.rs:83:62
|
75 | let found_images: Vec<_> = html.select(&image_selector).collect();
| ------------ has type `Vec<ElementRef<'_>>` which is not `Send`
...
83 | images.push(get_image(src.to_string(), width, height).await.unwrap());
| ^^^^^^ await occurs here, with `found_images` maybe used later
...
93 | }
| - `found_images` is later dropped here
note: required by a bound in `tokio::spawn`
--> C:\Users\Salt lick\.cargo\registry\src\github.com-1ecc6299db9ec823\tokio-1.29.0\src\task\spawn.rs:166:21
|
166 | T: Future + Send + 'static,
| ^^^^ required by this bound in `spawn`
error: future cannot be sent between threads safely
--> src\bg_scraper\mod.rs:45:42
|
45 | let temp_data = tokio::spawn(async move {
| __________________________________________^
46 | | let from = i * 24;
47 | | let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
48 | |
... |
51 | | images
52 | | });
| |_____________^ future created by async block is not `Send`
|
= help: within `ego_tree::Node<Node>`, the trait `Sync` is not implemented for `UnsafeCell<Option<Vec<string_cache::atom::Atom<markup5ever::LocalNameStaticSet>>>>`
note: future is not `Send` as this value is used across an await
--> src\bg_scraper\mod.rs:83:62
|
75 | let found_images: Vec<_> = html.select(&image_selector).collect();
| ------------ has type `Vec<ElementRef<'_>>` which is not `Send`
...
83 | images.push(get_image(src.to_string(), width, height).await.unwrap());
| ^^^^^^ await occurs here, with `found_images` maybe used later
...
93 | }
| - `found_images` is later dropped here
note: required by a bound in `tokio::spawn`
--> C:\Users\Salt lick\.cargo\registry\src\github.com-1ecc6299db9ec823\tokio-1.29.0\src\task\spawn.rs:166:21
|
166 | T: Future + Send + 'static,
| ^^^^ required by this bound in `spawn`
error: future cannot be sent between threads safely
--> src\bg_scraper\mod.rs:45:42
|
45 | let temp_data = tokio::spawn(async move {
| __________________________________________^
46 | | let from = i * 24;
47 | | let doc = get_text(link + &format!("/bgimage?from={}", from)).await.unwrap();
48 | |
... |
51 | | images
52 | | });
| |_____________^ future created by async block is not `Send`
|
= help: within `ego_tree::Node<Node>`, the trait `Sync` is not implemented for `*mut tendril::fmt::UTF8`
note: future is not `Send` as this value is used across an await
--> src\bg_scraper\mod.rs:83:62
|
75 | let found_images: Vec<_> = html.select(&image_selector).collect();
| ------------ has type `Vec<ElementRef<'_>>` which is not `Send`
...
83 | images.push(get_image(src.to_string(), width, height).await.unwrap());
| ^^^^^^ await occurs here, with `found_images` maybe used later
...
93 | }
| - `found_images` is later dropped here
note: required by a bound in `tokio::spawn`
--> C:\Users\Salt lick\.cargo\registry\src\github.com-1ecc6299db9ec823\tokio-1.29.0\src\task\spawn.rs:166:21
|
166 | T: Future + Send + 'static,
| ^^^^ required by this bound in `spawn`
Which is basically the same error message 100 times. But, from what I can tell, it's caused by:
images.push(get_image(src.to_string(), width, height).await... => await occurs here, with html and found_images maybe used later
I think (keyword: think) that it's just those two variables giving me shit, but I don't know how to actually make it work.
Asynchronous programming seriously isn't my strength, but it would be horribly inefficient to use blocking here... If anyone knows how to fix this, it would be really helpful!! I'll keep trying to debug it in the meantime.