Web scraping in Rust

zeroexcuses · September 9, 2022, 11:32am

For webscraping in Rust, is the standard approach: reqwest + scraper ? If not, what else is recommended ?

Simple scraping, no need to simulate a JS runtime / execute JS code.

erelde · September 9, 2022, 12:00pm

That's the combination I've used in the past. (with blocking reqwest, not async)

zeroexcuses · September 9, 2022, 12:35pm

Tutorials I've seen online also use 'blocking reqwest' instead of 'async reqwest'. I find this counter intuitive.

What is the advantage of blocking here ?

erelde · September 9, 2022, 12:41pm

It should be the reverse question isn't ? What's the advantage of async if you're doing simple scraping.

By "simple" here I understand: scraping a few html pages, maybe even only one.
In that case I think you're just better served by parallelism, not concurrency.

zeroexcuses · September 9, 2022, 12:46pm

My intuition, which might be wrong, is that scraping is IO heavy, not compute heavy; this seems like a very good match for async.

erelde · September 9, 2022, 12:47pm

It is IO bound. But if you're scraping one page, it doesn't make any difference.

In an async program scraping one page, you'll wait as much as in a sync program scraping the same page.

Roughly, you'll start to get benefits from concurrency when you start scraping more pages than you have CPUs.

zeroexcuses · September 9, 2022, 1:01pm

I think I found the confusion now. I wrote "scraping", but what I need is closer to "crawling + scraping".

erelde · September 9, 2022, 1:04pm

In that case you'll probably get the benefits from async. Just remember to reuse the same reqwest::Client where applicable. That's one pitfall I fall into again and again.

zeroexcuses · September 9, 2022, 9:43pm

What is the benefit of this? Intuitively, we are not starting a Chrome/Firefox session, merely making the computationally equiv of a wget/curl call -- what is the overhead of creating a new Client per request ?

erelde · September 9, 2022, 10:29pm

An http client is similar to a file handle, and subject to the same limits (I believe most http libraries' clients will in fact acquire a network file handle). One process can't have too many handles (cf ulimit on linux).

zeroexcuses · September 9, 2022, 10:37pm

I see, so the Client is more like ConnectionManager and ensures that the number of active tcp connections stays below a certain limit ?

quinedot · September 9, 2022, 10:47pm

It uses a connection pool and keep-alive connections as well, to handle multiple requests in a single session. Docs.

zeroexcuses · September 10, 2022, 1:26am

Let's see if I understand this correctly. Suppose we are fetching:

https://foo.bar.com:###/1
https://foo.bar.com:###/2
https://foo.bar.com:###/3
https://foo.bar.com:###/4
https://foo.bar.com:###/5

there is some cost associated with establishing a https connection to "foo.bar.com:####"; using the same Client allows us to use this one https connection and fetch all 5 URLs ?

quinedot · September 10, 2022, 1:31am

Correct. TCP handshake, TLS negotiation. It's not reqwest but see e.g. the output of wget http://google.com/{1..5} for keep-alive reuse.

--xxxx-xx-xx xx:xx:xx--  http://google.com/1
Resolving google.com (google.com)... 
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/2
Reusing existing connection to google.com:80.
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/3
Reusing existing connection to google.com:80.
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/4
Reusing existing connection to google.com:80.
...
--xxxx-xx-xx xx:xx:xx--  http://google.com/5
Reusing existing connection to google.com:80.

system · December 9, 2022, 1:31am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Managing exact concurrency with futures/tasks (web scraper) help	3	1027	May 10, 2020
How to use async reqwest with multithreading help	25	4536	May 15, 2020
Using async-std (was reqwest) help	34	6312	December 30, 2019
[Solved] Problem with reqwest 0.10 async client help	13	2009	April 1, 2020
Easy async http crate help	6	1393	January 12, 2023

Web scraping in Rust

Related Topics