I have posted the entirety of a simplified version of what I want to do
I am trying to go through the google search results page and try to extract the title, url and short description (that shows up on the search results) from the results.
Now I have mostly succeeded, but a problem remains that the 'description' contains things like <span>
and <em>
and  
and other things like that which I want to remove.
use reqwest::header::USER_AGENT;
use reqwest::Client;
use select::document::Document;
use select::predicate::*;
#[tokio::main]
async fn main() {
//Searching for potatoes
let res = google_search("potatoes".to_string()).await;
if let Some(results) = res {
for result in results {
println!("\n\ndesription: {}\n\n", result.description);
}
}
}
//Represents a single search result
pub struct SearchResult {
pub description: String,
}
const AGENT_STRING: &str =
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:34.0) Gecko/20100101 Firefox/34.0";
pub async fn google_search(search: String) -> Option<Vec<SearchResult>> {
//Search Query
let request_string = format!(
"https://www.google.com/search?q={}&gws_rd=ssl&num={}",
search, 5
);
//Sending The Request
if let Ok(body) = Client::new()
.get(request_string.as_str())
.header(USER_AGENT, AGENT_STRING)
.send()
.await
{
if let Ok(text) = body.text().await {
//The Response Body as Text
let document = Document::from(text.as_str());
let mut results: Vec<SearchResult> = Vec::new();
for node in document.find(
Attr("id", "search")
.descendant(Attr("id", "rso"))
.descendant(Class("g"))
.descendant(Class("rc"))
.descendant(Class("s"))
.descendant(Name("span").and(Class("st"))),
) {
let mut description = String::new();
description.push_str(&node.inner_html());
results.push(SearchResult { description });
}
return Some(results);
}
}
return None;
}
this results in outputs like
desription: Baked, roasted, mashed or fried — there's no wrong way to eat <em>potatoes</em>. From hearty meals to healthy sides, get creative with <em>potatoes</em> using these top-notch ...
How can I get rid of this without manually finding and replacing stuff? I would like a solution where I don't need to know any of the present elements and can dynamically omit any tags that show up, and leave only plain text.
In case you want to try and reproduce this
[dependencies]
tokio = {version = "0.2.21", features = ["full"]}
reqwest = "0.10.6"
scraper = "0.12.0"
select = "0.4.3"