Web Scrapper/Crawler in Rust

Hello Rustaceans,
So grateful for the help always received here. Have being learning rust by building a web scrapper. I need help on the new problem I encountered.

let url= String::from("https://www.jumia.com.ng/catalog/?q=iphone&page=2#catalog-listing");
    
    let res =
        reqwest::blocking::get(url).with_context(|| format!("opening url error"))?;
        
        let page_regex =
            Regex::new(".*page=([0-9]*).*").expect("");

            let urls= {
                let captures = page_regex.captures(&url).unwrap();
                let old_page_number = captures.get(1).unwrap().as_str().to_string();
                let mut new_page_number = old_page_number
                    .parse::<usize>()
                    .map_err(|_| Error::Internal("spider".to_string()))?;
                new_page_number += 1;
    
                let next_url = url.replace(
                    format!("&page={}", old_page_number).as_str(),
                    format!("&page={}", new_page_number).as_str(),
                );

            };

        
        let document = Document::from_read(urls).context("parsing response")?;

I'm trying to scrap an ecommerce website with mutltiples pages, I tried using regex to look for the next page and authomatically scrap it instead of providing links for every page manually. But it's not working. I actually see a similar login i a particular rust book. I will be grateful for all solutions.
Thanks a lot

If you know you're looking at https://www.jumia.com.ng/catalog/?q=iphone&page=2#catalog-listing
and want to find the next page. It might be easier to start with a structured representation of that page and build the urls rather than parse and mutate the url on the fly.

Something like this perhaps?:

struct JumiaSite {
    current_page: usize,
}

impl JumiaSite {
    fn get_url_for_page(&self, page_number: usize) -> String {
        format!("https://www.jumia.com.ng/catalog/?q=iphone&page={page_number}#catalog-listing")
    }

    fn get_url(&self) -> String {
        self.get_url_for_page(self.current_page)
    }

    fn get_next_page_url(&mut self) -> String {
        self.current_page += 1;
        self.get_url()
    }
}

fn main() {
    let mut site = JumiaSite { current_page: 2 };
    println!("{}", site.get_url());
    println!("{}", site.get_next_page_url());
    println!("{}", site.get_url_for_page(7));
}

I'm not sure what the problem you're facing is though so this might be no help at all!

When you say "it's not working"... what's the problem?
What happened, what did you expect to happen?

Oh and, if you're visiting thousands of different urls that all have a page=X query param. I'd look into crates that provide a proper url type and build the query params using that rather than the naïve string formatting I've done. The url crate should be able to parse the url string into a proper structure and avoid messing around with regexes yourself.

1 Like

Thanks a lot. This what I'm looking for. I add params and loop(while loop) through the process to collect data from every page.
Thanks :pray:

1 Like

@Michaelin007 you can also use the spider crate for this exact functionality. GitHub - madeindjs/spider: Multithreaded Web spider crawler written in Rust.

1 Like