Getting a web page's "final" html with Rust


#1

Is there a simple way with Rust to get the final html used to generate the version of a web page a user actually sees when they visit the page?

I’m thinking of a Rust-based way to get the html you would see if you right-clicked on a page in your browser and chose the “view page source” option.

These days, the “final html” is not the same as what you would get if you just used wget or curl to retrieve the page’s html, because javascript often manipulates the DOM to a significant extent after the initial html is loaded.

I’m aware that the fantoccini crate is one possible way of achieving what I’d like to do, but I am hoping there might be a simpler approach.

Thanks!


#2

“View page source” option is the same as wget/curl. DOM technically is never final and can keep changing all the time forever. You’d probably opt for a snapshot of DOM at time of load event.

  • To get DOM without JavaScript support (not modified by any scripts), you can use Rust’s html5ever crate.

  • To get DOM with JavaScript support, post JS modifications, you need a full web browser. Nothing less of entire browser engine will do it, since JS can read and manipulate everything about a browser. So interaction with webdriver, and things like headless Chrome are the way to do it.

Of course, you can also try running Servo.


#3

Thanks for the reply. And I appreciate the clarification about the html you get with the “view page source” option. I had assumed it was the same as the html you see with “inspect element” in a browser’s development tools. As you noted, however, that’s not the case.

It’s really the html that’s in place after the load event that I’m after, so Servo may be the only way to go.

One possible way of simplifying things might be to use the webdriver interface to Servo. I’ve found at least one example of how to do that. Ideally, though, I’d like to use Servo directly from Rust, rather than through the command line, but it’s not clear to me from the Servo documentation how to do what I want to do with Servo without resorting to the webdriver interface.


#4

It could work. I haven’t tried myself, but AFAIK it supports embedding, and supports headless, so it should be doable.


#5

It may not be the exact way you want to do it, but another method of getting the final HTML of a page is creating a small Firefox extension. That’s what I ended up doing on a similar project and it was pretty easy to setup. Here’s the script called by the extension, it executes when the page is finished loading and uses a websocket to connect to a Rust websocket server listening on localhost:3333.

var generatedSource = new XMLSerializer().serializeToString(document);
const socket = new WebSocket("wss://localhost:3333");
socket.addEventListener('open', function (event) {
	socket.send(generatedSource + "\n");
});

Here’s the manifest file for the extension (might want to double check the permissions, this was my first one)

{

  "manifest_version": 2,
  "name": "PageSave",
  "version": "1.0",

  "description": "Saves a local copy of the current page's source code",
  
  "applications": {
    "gecko": {
      "id": "<Your email address>"
    }
  },
  
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "js": ["page_save.js"],
      "run_at": "document_idle"
    }
  ],
  
  "permissions":[
    "<all_urls>",
    "activeTab",
    "storage",
    "tabs",
    "webRequest",
    "webRequestBlocking"
  ]
}

Doing a persistent install of the extension allows you to use it in headless mode.