Scraping ajax/spa with Rust

Short Description:

There are webpages that are Single Page Application / fetching data via Ajax. I want to interact with the webpages + scrape data. Preferably via Rust. (If some other language is better suited for this, I'm welcome to those suggestions too.)

What doesn't work:

  • wget (I need to be able to run the javascript in the HTML file)

What google returns:

  • https://github.com/servo/html5ever (I don't think this works. This seems to focus solely on parsing the HTML. I need something that at the least contains a JS VM).
  • scraper - Rust (this is another popular result, but seems to have same problem as above)

The API I am looking for: something like:

  • createBrowser() -> Browser

  • Browser::visit_webpage(url: &str) -> WebPage

  • WebPage::get_dom() -> some type of dom tree

  • WebPage::send_event(mouse | keyboard | resize | ... ) -> ()

And somewhere there is a server engine (perhaps servo) running and serving these requests.

EDIT:

Suggestions? I would prefer a Rust solution, but if other languages have better support for this task, I'm open to those too.

EDIT 2:

I am on Linux. Crazy solutions involving using a fake X window server to open a browser is acceptable.

2 Likes

On 15 second search, I'd say try out the headless_chrome crate.

The production ready SPA pre-renderers such as Rendora and prerender.io use the same technique of using headless chrome to render the page and grab the HTML, so its a relatively proven strategy and guaranteed to get accurate page results.

1 Like

@zicklag: Reading the examples / tests, this looks like it should work. I'm going to leave the question "unresolved" for a bit to see if other solutions pop up. Thanks!

1 Like

No problem! :slight_smile: Let me know how it goes!

since you said "preferably" rust, but other languages are ok, i think puppeteer is the most commonly used library for this use case. it's for node.js though, and supports chrome/ium and firefox as browser engines.

1 Like

It seems, unfortunately, that headless_chrome isn't maintained at the moment. Pull requests weren't merged for some time, and headless_chrome still uses failure for error handling, which makes it difficult to use with the current way to handle errors.

There were also a few things missing that I needed for my web scraping needs (and I wasn't able to extend headless_chrome as I needed. But most likely, only because of my lack of Rust skills at that time... ;)).

headless_chrome uses the DevTools protocol to automate web browsers.

For a few months now, I'm working on a new DevTools protocol client for Rust. I'm generating my DevTools protocol API automatically from the official protocol definition, so all DevTools protocol functionality will be available from the beginning (headless_chrome implements this API by hand, but only partially).

Unfortunately, it probably will still take at least 1-3 months until I will have something I can publish as a crate... :slight_smile:

There is also Fantoccini which uses WebDriver instead of the DevTools protocol.

From my research, it seems WebDriver is less reliable than a DevTools protocol-based approach.

There was also functionality missing in Fantoccini that I needed for my web scraping (if I remember correctly...).

4 Likes

Yeah, I just noticed that as I just started to try it out as an experimental Tauri backend to hack in Windows support until they get that nailed down.

Ooh, that sounds cool. I'd love to hear about that when it comes out. :slight_smile:

Yeah, it seems that web driver is a bit less featured.

1 Like

Hey @d4h0 do you have any links to documentation on the DevTools protocol?

Edit: NVM I found it:

1 Like

I've had a lot of success with fantoccini. It's a Rust interface to The Web Driver Protocol... Think of it as a remote control for Chrome or Firefox.

The upside of this approach is you run an actual web browser, so JavaScript frameworks like React and Vue will render and behave just like they would for a normal user. The downside is that you're running a web browser... It's super slow, opens an actual browser window, and isn't overly parallelisable.

1 Like

Anyone have intuition why the best solutions do not involve Servo? (This is rather surprising to me, as I was expecting some cool servo related library.)

:man_shrugging: I don't know. I think Servo might just not be as mature yet?

@d4h0 you don't happen to have anything at all that you could share for your WIP devtools library, even if it isn't completely ready, do you?

Essentially I need all the CDT protocol handling to happen in one thread in an event loop, but the headless_chrome crate uses a separate thread for its event handling and that is incompatible with the !Sync + !Send event handlers I need to support for my use-case. I'm faced with just handling the protocol manually, but if you had something that I could build on it might save some time.

Fine if you don't, but I figured I'd ask just in case.

Ooh, that sounds cool. I'd love to hear about that when it comes out. :slight_smile:

That's great, I will announce it on this forum when it's ready :slight_smile:

Hey @d4h0 do you have any links to documentation on the DevTools protocol?

Edit: NVM I found it:

Yes, that's the best resource page for the DevTools protocol.

Anyone have intuition why the best solutions do not involve Servo? (This is rather surprising to me, as I was expecting some cool servo related library.)

As far as I remember, Servo only can render basic HTML, CSS, and so on.

@d4h0 you don't happen to have anything at all that you could share for your WIP devtools library, even if it isn't completely ready, do you?

Sorry, but no, my crate isn't ready for anything yet. In the next few days, I will complete the low-level API (API types, protocol stuff, internal machinery). On top of this low-level API, I will then build the end-user API ("click", "wait_for_x", and so on).

The code still changes a lot, so that is the reason why I don't want to publish it yet.

Essentially I need all the CDT protocol handling to happen in one thread in an event loop, but the headless_chrome crate uses a separate thread for its event handling and that is incompatible with the !Sync + !Send event handlers I need to support for my use-case.

What I have so far is written in an executor agnostic and async way (so it can run on a single-threaded executor) – but out of curiosity, could you tell me why you have this limitation?

I was planning to create my crate threaded, instead of async, but it turned out that async has some benefits regarding the control flow I needed.

Btw., did you try to setup up a headless_chrome event handler that communicates with your !Sync + !Send code via a channel?

(From your first post)

Preferably via Rust. (If some other language is better suited for this, I'm welcome to those suggestions too.)

As you probably know, Puppeteer.js is most likely currently the best option (hopefully, that will change for Rust, when my crate is ready... ;)).

If you, like me, don't like JavaScript at all, then bs-puppeteer might be an option.

Basically, it's a binding for OCaml/ReasonML and Puppeteer.js (OCaml/ReasonML can be compiled to JavaScript via the BuckleScript compiler).

If I had to build something production-ready right in this second, I most likely would go this route (OCaml and ReasonML are also nice).

But, I don't like JavaScript and its ecosystem at all, so I try to use it as little as humanly possible...

(I'm planning to explore the possibility to compile and inject WebAssemply into browser frames, which hopefully would make JavaScript completely unnecessary... :)).

I'm faced with just handling the protocol manually,

To hack together something that is as type-safe as Puppeteer.js should only take a few hours.

Basically, you would do the following:

  1. get the WebSocket address of the server (printed to stdout, or available from http://localhost:9222/json/version via HTTP)

  2. connect to the WebSocket via tungstenite, async_tungstenite or tokio_tungstenite

  3. use serde_json to create commands and parse responses

But I guess it will be a real pain to use the DevTools protocol directly.

If you go this route, the following links should be useful:

'chrome-remote-interface' is a low-level API for the DevTools protocol and JavaScript (basically, what I have almost ready for Rust). The links above show how to do different things via the DevTools protocol.

1 Like

:+1:

It's actually because I was trying to experiment with a Chrome backend for Tauri, which normally uses webview to create an in-process webview. Because it is in-process and the webview event-loop is single-threaded, Tauri allows !Sync + !Send event handlers for webview JavaScript bindings. The only way to avoid breaking Tauri and that API and supporting Chrome as a seamless backend I need to be able to support !Sync + !Send event handlers as well and therefore run the main event-loop on one thread.

I tried, but I couldn't manage anything that would work. I may have been doing something wrong because that was my first thought and somehow when I put it down I couldn't get it to compile. :thinking: I might need to look at that again.
...
Ohhhhhh. I figured it out now. That would work. Ha. I though about that, but then I got lost somewhere in the attempted implementation before I fully understood what was wrong. It's easy to get lost in the compiler errors sometimes. Duh. Thank you for that. I should be able to get it to work, then.

This is going a bit off topic for a Rust forum, but, if the goal is to just solve the problem, would the ideal solution probably be something like Puppeteer + Typecript (for it's type system) rather than Rust + [not as complete libraries] ?

As a naïve opinion, I'm fairly certain there's probably about 10 better tools to do this than using rust, mostly because of libraries.

It's a valid point and not off-topic. I think that it is true that non-rust solutions are going to likely be better for this, but there are some situations, such as mine where I had to integrate with a Rust program/library, where you might need to do it from Rust as you only option. Or maybe your passion for Rust is just so much so that the worse library is worth it. :wink:

But still it's a valid point that Rust is not going to be the most suited for this at this point in time.

https://github.com/tatut/clj-chrome-devtools appears to be auto generated dev tools bindings for Clojure/JVM.

I wonder if we can use this to auto generate Rust bindings (hoping the work is just having it generate Rust structs/fn's instead of Clojure fn's).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.