Recommendations for HTML parsing

  1. Yes, I've checked . There are too many results and I am hoping for recommendations from personal experience.

  2. I have a number of HTML files. They are all < 50MB, so a non-streaming parser is fine.

  3. I want a library that does HTML file (as string) -> DOM Tree, from which I can then walk the tree and extract the text.

  4. I do not want a HTML file -> text solution. I want the intermediate DOM tree in case I want to check attributes of DOM elements.

  5. Recommendations?


1 Like

I think html5ever is pretty much the standard for this. It might be a bit low low level for what you're after though, and I think there are some higher level libraries built on top of it, but I've never tried any of them.


So far, I’ve had the best experience with kuchiki. It’s still not as easy as something like BeautifulSoup in Python.


I've been very happy with the scraper library. The API is way easier than html5ever directly, and it has good examples in the README too.


I have a similar requirement than the original question. After one year, are there any updated recommendations? I'm still very likely between kuchiki and scraper, but I'm not sure which could be best. Both are very scarcely documented (specially kuchiki is almost completely lacking examples), not very active (not sure if because traction was lost or because they are complete enough not to need much more work) and have more or less equivalent number of downloads in and stars/watchers in GitHub. One of the authors of Kuchiki is (was) also a main contributor for html5ever, and this is why I'm leaning more towards Kuchiki, but that's all I got to make a decision for now.

Did anyone tried both and can make a quick comparison?



This topic was automatically closed after 14 days. We invite you to open a new topic if you have further questions or comments.