Hi,
I'm building out a community driven search engine in Rust.
Currently I'm working on the web crawler. I refer to it as the "harvester" because I see it as the means to grab material for me to index. The harvester currently uses one struct called HtmlDocument.
To crawl a site for links, I have attached the dom manipulation functionality attached to the HtmlDocument struct.
The cloned repository is 2.4 GB even though it wouldn’t need to be more than 1 MB. You’re missing .gitignore file(s) for the target folders, so your repository contains and tracks compiled binaries!
To fix this, you can rewrite the history of your repository to decrease its size significantly: make sure you’re in the main directory of your repository and then do something like
Thank you so much. I ran the code, it looked to work. I nuked my local copy and redownloaded it and it seemed better. Wow! I did not know that. I'm sorry, that was not intentional. I appreciate you telling me. I'm surprised the project is so large already. I have things split up into a lot of projects. What is good project structure to mitigate size like this? Or is this normal?
The size was only due to the accidentally tracked target folders. I re-cloned the repo now, it’s now only 520 kilobytes total when cloned and checked out on my computer. No need to worry or split anything up.
Next thing you could do is set up a cargo workspace (by creating a Cargo.toml containing [workspace] and members = … etc, as described behind this link) so that things like cargo check/cargo test/etc can be used on to-level to check/compile/test all the contained crates.
Next thing you could do is run cargo clippy. It points out a few small issues in your code. Read the suggestions / warning to perhaps learn a thing; most of the issues can be fixed automatically with cargo clippy --fix. The warning/error saying “this if has identical blocks” in harvester/src/workers/harvest_worker.rs:35:37 can be addressed by combining the conditions with a logical or, e.g.
My next answer as an addition to this one as discourse only allows 3 answers by the same person in sequence...
I see you’re using the blocking std::sleep in an async fn (executed via tokio). Usually blocking functions should be avoided in async code. Instead use e.g. async non-blocking sleep function from tokio, sleep in tokio::time - Rust. On the other hand, you don’t really seem to be using the capabilities of async code at all anyways. No spawning or joining of any tasks/futures. You might as well write completely synchronous code (i.e. no async fn and no need to use tokio) using the API in reqwest::blocking - Rust.
This is all great. Thanks for the clippy advice because I wasn't sure of what to make of that when I ran it and now with your example it makes sense. Im looking at how to rework everything for workspaces, coming from js I wasn't sure how to best separate parts.
The async part I was actually thinking about this morning because I don't really need it and the crawler is blocked from doing work with the only thing that is async.
The only reason I use thread.sleep is to try and set up some rate limiting if doing a blind crawl. I wasn't sure on how to monitor active connection which is what I probably want to do.
I really appreciate the feedback and you looking through it.