Recommendations for HTML parsing

#1
  1. Yes, I’ve checked crates.io . There are too many results and I am hoping for recommendations from personal experience.

  2. I have a number of HTML files. They are all < 50MB, so a non-streaming parser is fine.

  3. I want a library that does HTML file (as string) -> DOM Tree, from which I can then walk the tree and extract the text.

  4. I do not want a HTML file -> text solution. I want the intermediate DOM tree in case I want to check attributes of DOM elements.

  5. Recommendations?

Thanks!

0 Likes

#2

I think html5ever is pretty much the standard for this. It might be a bit low low level for what you’re after though, and I think there are some higher level libraries built on top of it, but I’ve never tried any of them.

2 Likes

#3

So far, I’ve had the best experience with kuchiki. It’s still not as easy as something like BeautifulSoup in Python.

1 Like