Removing HTML Tags from a String obtained from Select Crate

I have posted the entirety of a simplified version of what I want to do

I am trying to go through the google search results page and try to extract the title, url and short description (that shows up on the search results) from the results.

Now I have mostly succeeded, but a problem remains that the 'description' contains things like <span> and <em> and &nbsp and other things like that which I want to remove.

use reqwest::header::USER_AGENT;
use reqwest::Client;

use select::document::Document;
use select::predicate::*;

#[tokio::main]
async fn main() {
    //Searching for potatoes
    let res = google_search("potatoes".to_string()).await;
    if let Some(results) = res {
        for result in results {
            println!("\n\ndesription: {}\n\n", result.description);
        }
    }
}

//Represents a single search result
pub struct SearchResult {
    pub description: String,
}

const AGENT_STRING: &str =
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:34.0) Gecko/20100101 Firefox/34.0";
pub async fn google_search(search: String) -> Option<Vec<SearchResult>> {
    //Search Query
    let request_string = format!(
        "https://www.google.com/search?q={}&gws_rd=ssl&num={}",
        search, 5
    );

    //Sending The Request
    if let Ok(body) = Client::new()
        .get(request_string.as_str())
        .header(USER_AGENT, AGENT_STRING)
        .send()
        .await
    {
        if let Ok(text) = body.text().await {
            //The Response Body as Text
            let document = Document::from(text.as_str());
            let mut results: Vec<SearchResult> = Vec::new();
            for node in document.find(
                Attr("id", "search")
                    .descendant(Attr("id", "rso"))
                    .descendant(Class("g"))
                    .descendant(Class("rc"))
                    .descendant(Class("s"))
                    .descendant(Name("span").and(Class("st"))),
            ) {
                let mut description = String::new();
                description.push_str(&node.inner_html());

                results.push(SearchResult { description });
            }
            return Some(results);
        }
    }
    return None;
}

this results in outputs like

desription: Baked, roasted, mashed or fried — there's no wrong way to eat <em>potatoes</em>. From hearty meals to healthy sides, get creative with <em>potatoes</em> using these top-notch&nbsp;...

How can I get rid of this without manually finding and replacing stuff? I would like a solution where I don't need to know any of the present elements and can dynamically omit any tags that show up, and leave only plain text.

In case you want to try and reproduce this

[dependencies]
tokio = {version = "0.2.21", features = ["full"]}
reqwest = "0.10.6"
scraper = "0.12.0"
select = "0.4.3"

Looks like parse_fragment does the job.

fn main() {
    let frag = scraper::Html::parse_fragment("Baked, roasted, mashed or fried — there's no wrong way to eat <em>potatoes</em>. From hearty meals to healthy sides, get creative with <em>potatoes</em> using these top-notch&nbsp;");
    for node in frag.tree {
        if let scraper::node::Node::Text(text) = node {
            print!("{}", text.text);
        }
    }
}

outputs "Baked, roasted, mashed or fried — there's no wrong way to eat potatoes. From hearty meals to healthy sides, get creative with potatoes using these top-notch"

That worked :blush:
Thanks a lot for taking the time to read through all of that and helping me out!

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.