Web scraping in Rust

For webscraping in Rust, is the standard approach: reqwest + scraper ? If not, what else is recommended ?

Simple scraping, no need to simulate a JS runtime / execute JS code.

That's the combination I've used in the past. (with blocking reqwest, not async)

Tutorials I've seen online also use 'blocking reqwest' instead of 'async reqwest'. I find this counter intuitive.

What is the advantage of blocking here ?

It should be the reverse question isn't ? What's the advantage of async if you're doing simple scraping.

By "simple" here I understand: scraping a few html pages, maybe even only one.
In that case I think you're just better served by parallelism, not concurrency.

My intuition, which might be wrong, is that scraping is IO heavy, not compute heavy; this seems like a very good match for async.

It is IO bound. But if you're scraping one page, it doesn't make any difference.

In an async program scraping one page, you'll wait as much as in a sync program scraping the same page.

Roughly, you'll start to get benefits from concurrency when you start scraping more pages than you have CPUs.

2 Likes

I think I found the confusion now. I wrote "scraping", but what I need is closer to "crawling + scraping".

In that case you'll probably get the benefits from async. Just remember to reuse the same reqwest::Client where applicable. That's one pitfall I fall into again and again.

1 Like

What is the benefit of this? Intuitively, we are not starting a Chrome/Firefox session, merely making the computationally equiv of a wget/curl call -- what is the overhead of creating a new Client per request ?

An http client is similar to a file handle, and subject to the same limits (I believe most http libraries' clients will in fact acquire a network file handle). One process can't have too many handles (cf ulimit on linux).

I see, so the Client is more like ConnectionManager and ensures that the number of active tcp connections stays below a certain limit ?

It uses a connection pool and keep-alive connections as well, to handle multiple requests in a single session. Docs.

2 Likes

Let's see if I understand this correctly. Suppose we are fetching:

https://foo.bar.com:###/1
https://foo.bar.com:###/2
https://foo.bar.com:###/3
https://foo.bar.com:###/4
https://foo.bar.com:###/5

there is some cost associated with establishing a https connection to "foo.bar.com:####"; using the same Client allows us to use this one https connection and fetch all 5 URLs ?

Correct. TCP handshake, TLS negotiation. It's not reqwest but see e.g. the output of wget http://google.com/{1..5} for keep-alive reuse.

--xxxx-xx-xx xx:xx:xx--  http://google.com/1
Resolving google.com (google.com)... 
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/2
Reusing existing connection to google.com:80.
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/3
Reusing existing connection to google.com:80.
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/4
Reusing existing connection to google.com:80.
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/5
Reusing existing connection to google.com:80.
4 Likes