Recommendations for HTML parsing

  1. Yes, I’ve checked . There are too many results and I am hoping for recommendations from personal experience.

  2. I have a number of HTML files. They are all < 50MB, so a non-streaming parser is fine.

  3. I want a library that does HTML file (as string) -> DOM Tree, from which I can then walk the tree and extract the text.

  4. I do not want a HTML file -> text solution. I want the intermediate DOM tree in case I want to check attributes of DOM elements.

  5. Recommendations?




I think html5ever is pretty much the standard for this. It might be a bit low low level for what you’re after though, and I think there are some higher level libraries built on top of it, but I’ve never tried any of them.



So far, I’ve had the best experience with kuchiki. It’s still not as easy as something like BeautifulSoup in Python.

