HTML parser that let's you calculate text size and position

Hi everyone!
I am looking for a rust library that can parse HTML and extract text from it, much like this C# renderer does. Basically, it parses HTML and calculates position and size of it's objects (like text, rectangles, lines, etc.). Size calculation can be overridden to suit your desired measure system.
If you know any Rust library that can do that, please do tell. A library that can extract text from HTML with info about it's font will suffice.
Thanks!

Parsing HTML and rendering are two very different tasks. For parsing you can use html5ever. For rendering, well, you need a rendering engine that can also handle things like CSS and fonts, I assume servo or parts of it fit that purpose.

1 Like

I don't necessarily need the library to do rendering for me, if I can get something like font-family and font-size from a paragraph and the actual text, it should be enough.
I will look into your suggestions, thank you.

Determining where everything goes on the page is the bulk of what an HTML renderer does; actually drawing the pixels is simple in comparison.

Other CSS things that can affect the layout and positioning of elements and text are font-weight, line-spacing, where the occluding float boxes are, which hyphenation mode is set, margins, padding, borders, whether the CSS has injected any content via :before or :after pseudo-classes, which display mode is active for all the parent elements (including odd ones like flex or table), and probably a whole lot more.

2 Likes

To expand on the issue somewhat;

The display: flex algorithm alone, easily the simplest after absolute, is dozens of spec pages long, even the core loop has over 10 steps, and it doesn't include anything about getting the actual sizes of the items being laid out.

The traditional block/document layout is far more complex, and nearly all layout logic requires interactivity with the item sizing, itself quite complex in no small part because it generally requires text layout, it's own field of absurd complexity.

In comparison, rendering after layout, while not trivial due to considerations around caching, hardware support, or rendering only what's visible, is fairly straightforward code logically. I'd guess it would be less than a fifth the size of layout code.

Worst of all for you; the spec leaves a lot up to individual browsers, there's hundreds of little details like default font sizes, scroll bars, form controls etc that can be dramatically different.

Likely the easiest way you're going to get something actually accurate here is to run the browser you want to match in headless mode and use getClientBounds() on everything after the page finishes loading.

Depending on what you're doing, a good option might be to abandon arbitrary HTML and fully layout and render everything yourself, which with restrictions can be not too much work (avoid layout interaction, inside out or outside in, with clipping/scrolling to connect them)

2 Likes

Extensive HTML 4.01 and CSS level 2 specifications support.

From that C# renderer, wow that's old! It is possible that the scope of HTML text layout may have been simpler back then.

It seems like you want HTML and CSS parsing to do whatever it is you're up to (it's not entirely clear what you're up to).

Servo and it's parts might be suitable for this, though I expect there are simpler alternatives out there on lib.rs