Introducing urls2disk, a crate that concurrently downloads a given list of webpages and saves them to disk, optionally converting them to PDFs


#1

I’d like to introduce urls2disk, a crate that concurrently downloads a given list of webpages and saves them to disk, optionally converting them to PDFs.

Any feedback would be greatly appreciated!

urls2disk is a rust crate that helps you to download a series of webpages in parallel and save them to disk. Depending on user-defined settings, urls2disk will either write the raw bytes of the webpages it downloads to disk or it will first convert them to PDF before writing them to disk. It’s helpful for general webscraping as well as for converting a bunch of webpages to PDF.

A key feature of urls2disk is that you can set a maximum number of requests per second while downloading webpages; so you can effectively throttle yourself so as not to run afoul of any servers that will block you if you hit them with too many requests at once.

Personally, I’m using urls2disk to download a bunch of SEC filings from the SEC’s website (like company annual reports, quarterly reports, etc.). I’m not a professional programmer, but rather an investor; so my day-to-day involves a lot of reading of financial reports. I like having them in PDF format in Dropbox so I can read them on my iPad and take notes. urls2pdf is one piece of a larger application that automates the gathering of these documents for me and I thought I would share it with the community.

In any case, hopefully this ends up being useful for others as well. If anyone has any comments or feedback, they would be most welcome. Again, I’m a hobbyist programmer; so I don’t have many people to get feedback from!