Can I get a review?

Hi,
I'm building out a community driven search engine in Rust.

Currently I'm working on the web crawler. I refer to it as the "harvester" because I see it as the means to grab material for me to index. The harvester currently uses one struct called HtmlDocument.

To crawl a site for links, I have attached the dom manipulation functionality attached to the HtmlDocument struct.

The main dirs:
https://github.com/salugi/ferryman/tree/main/harvester/src/workers
https://github.com/salugi/ferryman/tree/main/models/src/documents

I wrote document comments on the methods/function/structs detailing what each do and why. I've attempted to make it as readable as possible.

If I could get cursory feedback or comments on things to change, it would be much appreciated.

The cloned repository is 2.4 GB even though it wouldn’t need to be more than 1 MB. You’re missing .gitignore file(s) for the target folders, so your repository contains and tracks compiled binaries!

3 Likes

To fix this, you can rewrite the history of your repository to decrease its size significantly: make sure you’re in the main directory of your repository and then do something like

git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch harvester_cli/target harvester/target models/target scratches/record_scratches/target scratches/target' -f

git push --force

(edited the instructions, as I see you’ve already created the gitignore)

Thank you so much. I ran the code, it looked to work. I nuked my local copy and redownloaded it and it seemed better. Wow! I did not know that. I'm sorry, that was not intentional. I appreciate you telling me. I'm surprised the project is so large already. I have things split up into a lot of projects. What is good project structure to mitigate size like this? Or is this normal?

The size was only due to the accidentally tracked target folders. I re-cloned the repo now, it’s now only 520 kilobytes total when cloned and checked out on my computer. No need to worry or split anything up.

Next thing you could do is set up a cargo workspace (by creating a Cargo.toml containing [workspace] and members = … etc, as described behind this link) so that things like cargo check/cargo test/etc can be used on to-level to check/compile/test all the contained crates.

Next thing you could do is run cargo clippy. It points out a few small issues in your code. Read the suggestions / warning to perhaps learn a thing; most of the issues can be fixed automatically with cargo clippy --fix. The warning/error saying “this if has identical blocks” in harvester/src/workers/harvest_worker.rs:35:37 can be addressed by combining the conditions with a logical or, e.g.

if links.len() > link_limit || indexer >= page_limit || indexer >= links.len() {
    break;
}

You could also consider using automatic formatting via cargo fmt.


@salugi

My next answer as an addition to this one as discourse only allows 3 answers by the same person in sequence...

I see you’re using the blocking std::sleep in an async fn (executed via tokio). Usually blocking functions should be avoided in async code. Instead use e.g. async non-blocking sleep function from tokio, sleep in tokio::time - Rust. On the other hand, you don’t really seem to be using the capabilities of async code at all anyways. No spawning or joining of any tasks/futures. You might as well write completely synchronous code (i.e. no async fn and no need to use tokio) using the API in reqwest::blocking - Rust.

This is all great. Thanks for the clippy advice because I wasn't sure of what to make of that when I ran it and now with your example it makes sense. Im looking at how to rework everything for workspaces, coming from js I wasn't sure how to best separate parts.

The async part I was actually thinking about this morning because I don't really need it and the crawler is blocked from doing work with the only thing that is async.
The only reason I use thread.sleep is to try and set up some rate limiting if doing a blind crawl. I wasn't sure on how to monitor active connection which is what I probably want to do.

I really appreciate the feedback and you looking through it.

You’d just need to add a file
Cargo.toml

[workspace]
members = [
    "harvester_cli",
    "harvester",
    "models",
    "scratches/record_scratches",
    "scratches"
]

in the main directory. No changes required besides that.

Thats really easy, thanks for spelling that out for me. I wasn't sure how to incorporate it looking at examples. This is such a cool language.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.