How to scrape the html table contents from a String

The background of this question is not that I am unable to perform the request, which is scraping the contents of a html table into a struct. I am currently using a crate that does this already.
The problem is that rust compiler warns that the crate is using functionality that is deprecated and might be removed in a future version of rust:

warning: the following packages contain code that will be rejected by a future version of Rust: cssparser v0.25.9, selectors v0.21.0

These packages are not explicitly listed as a dependency in Cargo.toml, so they are a dependency of one of the crates that I use. Luckily, this can easily be investigated using cargo tree:

 % cargo tree -i cssparser
cssparser v0.25.9
├── scraper v0.11.0
│   └── table-extract v0.2.2
│       └── yb_stats v0.6.1 (/Users/fritshoogland/code/yb_stats)

And

 % cargo tree -i selectors
selectors v0.21.0
└── scraper v0.11.0
    └── table-extract v0.2.2
        └── yb_stats v0.6.1 (/Users/fritshoogland/code/yb_stats)

Aha! So the cssparser and selectors deprecated crates comes from table-extract!

I am using the latest version of table-extract already, but it hasn't had an update in quite a while. (November 2019).

When I manually update the versions of the dependencies to higher versions, the crate dependencies flip it back. So it seems I cannot force a higher version.

I guess that the logical next move is to see if I can perform the same functionality with another, more recent and updated crate. So that is the actual question I have: how can I scrape the data from a html table in a very easy way to move the data row by row into a predefined struct in a vector?

This is my current code using table-extract:

fn parse_threads(
    http_data: String
) -> Vec<Threads> {
    let mut threads: Vec<Threads> = Vec::new();
    if let Some ( table ) = table_extract::Table::find_first(&http_data) {
        let empty_stack_from_table = String::from("^-^");
        for row in &table {
            let stack_from_table = if row.as_slice().len() == 5 {
                &row.as_slice()[4]
            } else {
                &empty_stack_from_table
            };
            threads.push(Threads {
                thread_name: row.get("Thread name").unwrap_or("<Missing>").to_string(),
                cumulative_user_cpu_s: row.get("Cumulative User CPU(s)").unwrap_or("<Missing>").to_string(),
                cumulative_kernel_cpu_s: row.get("Cumulative Kernel CPU(s)").unwrap_or("<Missing>").to_string(),
                cumulative_iowait_cpu_s: row.get("Cumulative IO-wait(s)").unwrap_or("<Missing>").to_string(),
                stack: stack_from_table.to_string()
            });
        }
    }
    threads
}

A String containing the HTML is given to the function, which then uses the table_extract function to parse the first table it encounters. This is not ideal, it works in a very basic way, and only parses <tr>,</tr> as a row and <td>,</td> as fields, and doesn't parse some <tr> data (<tr data-depth="3" class="level3"> is not found), but it does the job.

So the question is: are there equally simple ways to do the same with more modern crates?

How about html5ever from the Servo project? It should be both good and is up to date. And its a full-scale HTML parser, so you can extract all the data you need.

2 Likes

The problematic table-extract crate depends on an old version of the scraper crate, which is an actively-maintained interface for html5ever. One solution would to look at the source of table-extract and reimplement the needed functionality with scraper.

1 Like

I have searched and read about html5ever, but I have been unable to find a practical example of how to use it to extract table data. Almost all examples show how it parses html and creates a tree.

Thank you for your answer. I am just learning Rust and therefore don't feel comfortable nor having the knowledge to reimplement a crate.

With this version, you can replace table-extract with scraper: Rust Playground. The behavior should be identical to what table-extract was doing, but you should test it on your own tables to be certain. In particular, table-extract uses a HashMap<String, usize> to track which headers are at which positions, but I moved the logic into parse_threads. Feel free to ask any questions about the code.

1 Like

Yes, so the tree has nodes with the node type in it, right? A table corresponds to a <table> node. So you have to find the <table> node, and extract the data from the sub tree rooted at that node.

Thank you so much for your effort. I implemented it, and it works like a charm!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.