Architecture for writing a crawler in Rust

I am writing something that can be qualified as a "crawler". I selected Rust for the task because of efficiency and ease of deployment/debugging/maintaining. So far it has been great in this regard, but I've gotten the architecture quite wrong. It has nested logic which can often fail, because it does things over the network. Often parts of the logic don't get executed because some previous step failed, even though the previous step is not necessarily a dependency.

I think I need a different architecture, maybe something message oriented (or maybe not, ideas welcome)... But... I don't want to use something like RabbitMQ or any "micro-services" based architecture, as it would defeat the purpose of going lean which is my main goal. I just want to make my software more robust without increasing the complexity a ton, so I want to do it in rust code.

My question is mostly in two directions:

  1. What should I read? Do you know of something on the topic that is preferably rust specific (I don't want a tutorial on using RabbitMQ, as I don't intend to use it). I should probably read something abstract on the topic (of message based architectures), but I also don't want to go too deep there, as I just want my software working, not to study several months :slight_smile:
  2. What crates are available for this kind of architecture. Is there some framework that can simplify my task a lot? Please note that this is not a matter of speed or scalability, just a matter of robustness, so I don't want some crazy distributed framework.

Of course, any "general advice" is also very welcome.

You'd need to provide more details to get any useful advice. From what you've told, it is unclear what you're trying to achieve. "Crawler" is fairly vague. Also, RabbitMQ is a message passing system but it is unclear why you're thinking about it in this context.

What architecture?

Well, I didn't mention terms like "data pipeline", "ETL" etc., because 1) I am not sure that this is what I want/need to do and 2) I am open to general ideas on how to rearchitect my code.

If my explanation on what the software does is not clear, I'll try to explain it better now in pseudo(rust)code:

fn crawler() {
  loop {
    let resA = fetch_A_from_the_network();
    match resA {
       caseA => fetch_B_from_the_network(),
       caseB => {
            do_calculations();
            let resB = fetch_B_from_the_network();
            // do other things
           case resB {
                 caseC => fetch_A_again(),
                 caseD => fetch_D()
          } 
     }
}

This is a very rough sketch, but the idea is that I have relatively complex logic, that is deeply nested and at the deepest levels I do network calls which can fail. This is causing me problems that are of both kinds:

  1. Something fails deep in the logic and in a way that wasn't foreseen - rendering the software inoperable
  2. In order to get to D I have to go through A, although A (which may fail) is not a dependency to D, but B is (and A is dependency to B, but for D we can use a cached version of B). Of course I can use a database to store intermediate results, but I also need some rate limiting (roughly try D on every A retrieval, but if A fails sometimes, it's not a problem).

So I want to flatten the logic somehow and make it less interdependent. I want both general ideas on how to do this (but rust specific) and concrete crates/frameworks that can be used. Also I would be happy if someone can share specific case studies for writing such software.

You cannot make something that is inherently dependent, independent. However, if you separate the handling of separate components into separate functions, instead of one giant match block with everything inlined, then things will look much better.

To represent things like this, just a match statement isn't enough. You need to represent the dependency graph somehow inside your code. Then using that dependency graph (I hope a tree), you can encode logic to decide when to "trigger" running of something.

1 Like

I would probably remove the nesting/recursion and phrase the application as a queue of jobs.

struct Job {
  attempt_number: usize,
  url: String,
  page_type: PageType,
}

/// The different kinds of pages that may be processed.
enum PageType {
  A,
  B,
  C,
}

From there, I would write a top-level loop that reads jobs from some sort of MPSC channel.

To handle spurious errors, you can create a custom error type with a retryable() method and track how many times the same job has been attempted. Your hypothetical process_job() function would accept some sort of JobContext object that it can use to do things like scheduling new jobs (e.g. you just found a new link to crawl) or scraped data in a database. It also gives you a natural place to store configuration.

#[tokio::main]
async fn main() {
    let max_concurrency = 42;
    let max_attempts = 5;

    let (sender, receiver) = tokio::sync::mpsc::channel(max_concurrency);
    let ctx = JobContext { sender };

    while let Some(job) = receiver.recv().await {
        let ctx = ctx.clone();
        tokio::spawn(async move {
            match process_job(job, ctx) {
                Ok(_) => {
                    // Page crawled successfully. Yay!
                }
                Err(e) if e.retryable() && job.attempt_number < max_attempts => {
                    // reschedule the job
                    sender
                        .send(Job {
                            attempt_number: job.attempt_number + 1,
                            ..job
                        })
                        .await;
                }
                Err(e) if e.retryable() && job.attempt_number => {
                    // tried too many times. Log it and continue.
                }
                Err(e) => {
                    // It doesn't make sense to retry this. Log it and ignore the job.
                }
            }
        });
    }
}

struct JobContext {
    sender: Sender<Job>,
}

enum JobError {}

impl JobError {
    fn retryable(&self) -> bool {
        todo!()
    }
}

The way I would make jobs less interdependent is by matching on the kind of page you are scraping and invoking a routine specific to it. Sometimes extra data needs to be passed from one page to another (e.g. you scrape page A and then use its title when scraping page B), in which case you can store it in the enum that tells you what type of job it is.

async fn process_job(job: &Job, ctx: JobContext) -> Result<(), JobError> {
    match job.page_type {
        PageType::A => handle_page_a(&job.url, ctx).await,
        ...
    }
}

It also depends on how far down the rabbit hole you want to go. You could add some sort of exponential backoff to the retries, a memoization layer so you don't re-scrape the same page twice, etc.

1 Like

Already done, that was pseudocode.

Yeah, exactly. I am asking about 1) things to read on the topic and 2) specific crates that can be used 3) case studies that have been published, preferably in the same domain of "crawlers"

Thank you! This is something I have been thinking about, but was not able to conceptualize it as well. I was also imagining that there would be some framework for this, wasn't aware that tokio provides some of the building blocks for achieving this.
I was also reading on finite state machines, which probably can be combined with your suggestion. This was really helpful!

If someone knows of a blog post on the topic, will be happy to read further.

There probably are frameworks which will let you do this sort of thing, but I often recommend people to first try rolling their own implementation.

Then once you understand the problem domain better and the various constraints or nuances around it, you can switch to a framework that provides a nice API and handles a lot of the nitty-gritty details like retries and things like back pressure so you can parallelise the work without overloading your computer.

At the end of the day, a framework is just a chunk of code that takes primitives like channels (e.g. tokio::sync::mpsc) and tasks (tokio::spawn()), hard-code a bunch of domain-specific logic, and wrap it up in a pretty API that follows their desired workflow.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.