Threadsafe HTML parsing

I am working on a GUI for web scraping that detects videos on a website, and for that I need to parse HTML pages in the background. I already wrote the parsing code using the select crate and now tried to integrate it with the GUI for which I am using the iced crate. The relevant code roughly looks like this:

iced::Command::perform(async move {
    // perform the webscraping
    detect_videos(&username, &password).await.unwrap()
}, Message::VideosFound)

My problem at this point is that iced::Command::perform() expects the future to be Send. However within the future, I use the non-Send type select::document::Document for storing the HTML DOM, and therefore the code does not compile. For other HTML parsing crates than select, I found the same problem: e.g. the types scraper::html::Html or crabquery::Document or also non-Send. I believe the underlying problem is that all those libraries rely on the non-Send Tendril type from html5ever.

I am looking for tips about either a HTML parsing library that implements a Send type for storing the DOM, or other tips how the problem can be circumvented. I might be missing something because I'm relatively new to async. Any help would be appreciated!

Generally async code only requires that values stored across an .await are Send. If you write your code such that the non-Send values are used only in non-async methods, then call those non-async methods from your async code, then you wont get that error.

4 Likes

Thanks, that worked for me! I wrapped all the computations with non-Send data structures into scopes to enforce them being dropped before the next .await. This way the code compiles.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.