Scraper help: "<b>foo</b> bar" -> "bar"

  1. Suppose we have a ElementRef that points to:
<b>foo</b> bar
  1. Now, at this point, it is very easy to extract the "foo". We just use selector "b".

  2. How do we extract the "bar" ? The problem here is that (a) the parent element of "bar" is <b>foo</b>bar, i.e. contains too much info, but there is no tag that separates the bar from the <b>foo<b>

This can be done by filtering the direct children to find only the text nodes:

/*
[dependencies]
scraper = "0.13.0"
*/

use scraper::{Html, Node, Selector};

fn main() {
    let html = Html::parse_fragment("<b>foo</b> bar");
    let root = html.root_element();
    let sel = Selector::parse("b").unwrap();
    let inner: String = root.select(&sel).next().unwrap().inner_html();
    let outer: Vec<&str> = root
        .children()
        .filter_map(|node| match node.value() {
            Node::Text(text) => Some(&text[..]),
            _ => None,
        })
        .collect();
    assert_eq!(inner, "foo");
    assert_eq!(outer, [" bar"]);
}

The surrounding whitespace can be stripped with str::trim() or any similar method.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.