Web scraper/archiver - crate recommendations?

Was discussing on Discord and thought I'd keep it going in an easier to search format.

Let's say I wanted to build a Wayback Machine like archive.org, or a multi-domain end-to-end testing system. What would be needed for a high-performance variant?

Ideas (just crates.io links):

Parse HTML

Download (basic HTTP requests)

Download (gather associated CSS, JS, HTML but no JS rendering / AJAX)

Download (CSS, JS, HTML, JS rendering, AJAX, screenshots)

Storage

1 Like

I think the biggest component you are missing for a Wayback Machine is some sort of job queue or higher-level scheduler.

Having some sort of scheduler component will give you nice things like being able to resume crawling, track which pages have been seen and when, the ability to scale out to multiple workers, re-scheduling pages that seem to have 500'd spuriously or failed to connect, retries when a worker stops responding, and so on.

To that end, I would probably add the following:

  • sqlx - for interacting with a database
  • tonic - for communicating between the master node and workers

You may also want to reconsider using browser automation for viewing pages. Using a browser to view bulk pages will be super slow, memory-intensive, and mean you can't really scrape pages concurrently. This is what the Selenium docs say:

Link spidering

Using WebDriver to spider through links is not a recommended practice. Not because it cannot be done, but because WebDriver is definitely not the most ideal tool for this. WebDriver needs time to start up, and can take several seconds, up to a minute depending on how your test is written, just to get to the page and traverse through the DOM.

Instead of using WebDriver for this, you could save a ton of time by executing a curl command, or using a library such as BeautifulSoup since these methods do not rely on creating a browser and navigating to a page. You are saving tonnes of time by not using WebDriver for this task.

Let's also throw in this crate for good measure:

  • robotstxt - so you can respect a site's robots.txt file

I agree; using a rendering system like PhantomJS or one of the headless forms of the modern browsers will dramatically slow everything down.

I just don't know any other way of:

  • Screenshot capture of website
  • Clicking
    • Accept / I agree (GDPR);
    • Close notification/newsletter modal

Obviously the regular crawl method can give me the assets (JS, CSS, images, fonts) but getting the screenshot would be handy and so many newer websites are heavy on the JS; it's almost impossible to browse with JavaScript disabled.


Good suggestions on the queue and tonic sides. Thanks

1 Like

There is a JS/TS engine wirtten in rust https://deno.com/ but I'm not sure if the JS part is usable from rust.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.