Newbie's Performance Problem: Http Request, Regex and Save Data || Why my Rust code slower than C#

Hi everyone! Good day to you!

I'm a beginer, a little bit familiar with C# and totally new in Rust. I believe that there is something wrong with my Rust code, so that it run so slow compare to C# version.

[PROBLEM SOLVED]
Special thanks to Erelde, Raidwas and Steffahn. I've reached 11 posts limited for newbie, so I update the information here...

Yes, Rust is fast again.

3

Thanks you everyone. I was so hopeless when I entered the world of Rust, all my though today was to give it up and return to my sweetie C#... But you guys make me feel alive again to back to Rust hell =)). YES, RUST CREATE A NEW CLIENT FOR EACH reqwest::blocking::get(url) SO SPAWN THE REQUEST IS SO SLOW.

Here is how it work, as Erelde said:

//My inner loop, spawn request from pre-created client
let mut response = "".to_owned();
client.get(url).send().unwrap().read_to_string(&mut response).unwrap();

Also thanks to Steffahn, when I search the keyword "why rust http request so slow", I read one of your answer on another post and I find useful informations there.


Problem solved! Happy Rust-ing :gift_heart:


OUTDATED QUESTION SECTION

For short, these code do:

  1. Send a request to website and get the HTML.
  2. Use regex to search data in HTML.
  3. Contain the data with a variable.
  4. After reach a certain amount of data, write to file.
  5. Loop.
use std::io::{Read, Write};
use std::time::{Instant};
use std::{thread,io}; 

use std::fs::{File, OpenOptions};
use std::path::Path;

use chrono::{NaiveDate, Duration, Datelike};
use regex::Regex;

fn main() {
    //tik tok...
    let mut start = Instant::now();
    
    //Thank you semicolon for warning me about regex::new problem.
    //I'm not sure I did this implement right. This regex passed down to function.
    let regex:Regex = Regex::new("span data-value=\"(.*?)\" ").unwrap();
    //---------------------------------------------------------------------------

    //the date of the first record to crawl.
    let mut date: NaiveDate = NaiveDate::from_ymd_opt(2023, 01, 27).unwrap();
    //the path to back up the data.
    let bdpath: String = "BinhDuong.txt".to_owned();
    //call data crawler.
    MassCrawlBD(date, bdpath, regex);
    
    let duration = start.elapsed();
    println!("200 records crawled, runtime = {:?} ms", duration.as_millis());

    //prevent the f*in console disappear: console - readline.
    io::stdin().read_line(&mut String::new()).unwrap();
}

//->
fn MassCrawlBD(since: NaiveDate, path: String, regex: Regex)
{
    //let mut start = Instant::now();

    let mut date = since;
    let mainurl = "https://www.kqxs.vn/mien-nam/xo-so-binh-duong?date=";

    let mut output: String = "".to_owned();

    let mut start = Instant::now();
    for i in (1..=200)
    {
        let startday = format!("{:02}-{:02}-{:?}", date.day(), date.month(), date.year());
        let url: String = mainurl.to_owned() + &startday;
        let xLocation = "BD";

        output = output + "\n" + &PrizeReader(url.to_owned(), startday.to_owned(), xLocation.to_owned(), &regex);

        //println!("{}", output);

        //explain: Each week have 1 different record.
        //I count back 7 days from the last record to get the date of older data.
        date = date - Duration::days(7);

        //let duration = start.elapsed();
        //println!("1 records crawled, runtime = {:?} ms", duration.as_millis());
        start = Instant::now();

        //for the case when internet speed is slow, change 200 to 100, 50, 25... for data backup.
        //I use 200 for benchmark to make sure program only have to back up ONE TIME.
        if (i%200==0 && i > 0)
        {
            BackUpData(&path, &output);
            output = "".to_owned();
        }
    }
}

//-> Backup crawled data.
fn BackUpData(path: &str, data: &str) //-> std::io::Result<()>
{
    let mut file = OpenOptions::new()
            .write(true)
            .append(true)
            .open(path)
            .unwrap();
            
    //Thanks you H2CO3 and zirconium-n
    writeln!(file, "{}", data);
}

//Thanks you jofas for show me how to pass reference more efficient.
fn PrizeReader(url: String, xDay:String, xloc: String, regex: &Regex) -> String {

    let response = reqwest::blocking::get(url)
        .unwrap()
        .text()
        .unwrap();

    //These code comment out because I find that html scraper is slower than regex.
    /*
    let document = scraper::Html::parse_document(&response);
    let title_selector = scraper::Selector::parse("div.quantity-of-number>span").unwrap();
    let data = document.select(&title_selector).map(|x| x.inner_html());
    let mut output: String = xDay + " " + &xloc;
    let mut target = Vec::new();
    data.zip(1..=18).for_each(|(item, number)| { target.push(item) });    
    for i in (1..target.len()).rev() { output = output + &" " + &target[i]; }*/
    
    //regex version: get the value inside `span data-value="..." `
    let mut output: String = xDay + " " + &xloc;
    //let _regex = Regex::new("span data-value=\"(.*?)\" ").unwrap();

    for cap in regex.captures_iter(&response) { output = output + &" " + &cap[1]; }

    //program is not slow down because of the ammount of comments I put in, is it?
    return output;
}

BENCHMARK UPDATED 22:06 30-01-2023
After all the optimizes...

C# (debugged mode) = 13 -> 15 seconds.
Rust (debugged mode) = 15 -> 17 seconds.

C# (released) = 4 -> 6 seconds.
Rust (release) = 11 -> 13 seconds.

This benchmark is for 200 records of data (200 requests). Did I do something wrong? Please show me how can I optimize my code to be better.

This is my C# code... Please just ignore the multithread part, it's not related. I multithreaded to read multi-different url. For 5 different type of record, released mode of this code run under 6 seconds (about 1000 requests).

using System.Net;
using System.Runtime.CompilerServices;
using System.Text.RegularExpressions;

namespace BetaTech
{
    internal class Program
    {
        public static bool debuglog = false; 
        public static bool tracktime = false;

        public static string BinhDuongURL = "https://www.kqxs.vn/mien-nam/xo-so-binh-duong?date=";
        public static string BinhDuongPath = "BinhDuong.txt";
        public static string BinhDuongMark = "BD";
        public static DateTime lasttimeBinhDuong = new DateTime(2023, 01, 27);

        public static string DongThapURL = "https://www.kqxs.vn/mien-nam/xo-so-dong-thap?date=";
        public static string DongThapPath = "DongThap.txt";
        public static string DongThapMark = "DT";
        public static DateTime lasttimeDongThap = new DateTime(2023, 01, 30);

        public static string CaMauURL = "https://www.kqxs.vn/mien-nam/xo-so-ca-mau?date=";
        public static string CaMauPath = "CaMau.txt";
        public static string CaMauMark = "CM";
        public static DateTime lasttimeCaMau = new DateTime(2023, 01, 23);

        public static string VungTauURL = "https://www.kqxs.vn/mien-nam/xo-so-vung-tau?date=";
        public static string VungTauPath = "VungTau.txt";
        public static string VungTauMark = "VT";
        public static DateTime lasttimeVungTau = new DateTime(2023, 01, 24);

        public static string BacLieuURL = "https://www.kqxs.vn/mien-nam/xo-so-bac-lieu?date=";
        public static string BacLieuPath = "BacLieu.txt";
        public static string BacLieuMark = "BL";
        public static DateTime lasttimeBacLieu = new DateTime(2023, 01, 24);

        public static (string url, string backupPath, DateTime since, string mark)[] allTarget = new[]
        {
            (BinhDuongURL, BinhDuongPath, lasttimeBinhDuong, BinhDuongMark),    //1
            (DongThapURL, DongThapPath, lasttimeDongThap, DongThapMark),        //2
            (CaMauURL, CaMauPath, lasttimeCaMau, CaMauMark),                    //3
            (VungTauURL, VungTauPath, lasttimeVungTau, VungTauMark),            //4
            (BacLieuURL,BacLieuPath, lasttimeBacLieu, BacLieuMark),
        };

        static void Main(string[] args)
        {
            DateTime start = DateTime.Now;
            AllYouNeedToDoIsF555555(200);
            TimeSpan duration = DateTime.Now - start;

            Console.WriteLine($"Done in: {duration.TotalMilliseconds.ToString("0.00")} ms");
            Console.ReadLine();
        }

        static void AllYouNeedToDoIsF555555(int COUNT)
        {
            Thread[] threads = new Thread[allTarget.Length];
            
            for (int i = 0; i < allTarget.Length; i++) { int _i = i; threads[_i] = new Thread(() => massCrawler(allTarget[_i].since, allTarget[_i].url, allTarget[_i].backupPath, COUNT, allTarget[_i].mark)); }
                        
            for (int i = 0; i < threads.Length; i++) { int _temp = i; threads[_temp].Start(); }

            for (int i = 0; i < threads.Length; i++) { int _temp = i; threads[_temp].Join(); }
        }

        //------------------------------------
        // Crawl Part
        //------------------------------------
        public static void massCrawler(DateTime date, string mainurl, string backupPath, int recordcount, string mark)
        {
            DateTime since = date;

            string output = "";

            for (int i = 0; i < recordcount; i++)
            {
                DateTime start = DateTime.Now;

                string url = mainurl + since.ToString("dd-MM-yyyy");

                output += $"{since.ToString("yyyy-MM-dd")} {mark} {GetPrize(url, 18)}\n";

                if (debuglog) Console.WriteLine(output);

                if (i < recordcount - 1) { since = since.AddDays(-7); }

                Console.WriteLine($"One loop: {(DateTime.Now - start).TotalMilliseconds} ms");

                //Early Backup Condition (Slow Internet) == (i > 0 && i % 200 == 0) || 
                if (i == recordcount - 1)
                {
                    BackUpData(backupPath, output);
                    output = "";
                }
            }
        }

        public static string GetPrize(string url, int prizecount)
        {
            string raw = GetHTML(url);
            string output = "";

            MatchCollection matchList = Regex.Matches(raw, "span data-value=\"(.*?)\" ");

            for (int i = matchList.Count - 1; i >= 0; i--) 
            {
                if (i == matchList.Count - 1) output += $"{matchList[i].Groups[1].Value}";
                else output += $" {matchList[i].Groups[1].Value}";
            }

            return output;
        }

        public static string GetHTML(string url)
        {
            string output = "";

            var request = (HttpWebRequest)WebRequest.Create(url);

            if (debuglog) Console.WriteLine(url);

            var response = (HttpWebResponse)request.GetResponse();
            using (var streamReader = new StreamReader(response.GetResponseStream())) { output = streamReader.ReadToEnd(); }

            return output;
        }

        //------------------------------------
        // BACKUP
        //------------------------------------
        public static void BackUpData(string path, string data)
        {
            if (File.Exists(path)) { File.WriteAllText(path, $"{File.ReadAllText(path)}{data}"); }
            else { File.WriteAllText(path, data); }
        }

    }
}
1 Like

One easy thing would be to only call Regex::new once and pass it down to the functions that need it. Some Regex libraries keep a global cache so it's fine in simple cases to not pre-compile if you only use one or two distinct regular expressions, but I don't believe the regex crate does that.

1 Like

It is spectacularly hard to compare performance of the Rust code with a piece of C# code that you didn't post.

Anyway, this is probably trivial, but is there any chance you are comparing serial/blocking code with concurrent/parallel code? The async hype train is so fast nowadays, I wouldn't be surprised if you simply learned a concurrent (eg. async) approach by default in C# without realizing. Thus, you may be waiting for many HTTP requests in parallel in C#, while you are blocking on each individual request in Rust.

Another thing is that your file writing is accidentally quadratic. Upon each iteration, you read the whole file into memory, you append to the in-memory buffer, and then you write out the whole thing again. Thus, you are always reading and writing approximately twice as much data as you did so far upon each subsequent iteration. You are probably misunderstanding how appending to a file works.

Not a performance remark but a bug in your code: checking if a file exists and creating if it doesn't is a serious mistake, it's a TOCTOU race condition. Don't do that. Just create the file for appending unconditionally, the FS will do the right thing.

As a final note, your naming conventions and formatting are way off. Please use Clippy and Rustfmt to make your code palatable to others.

6 Likes

Thank you very much for spent time looking into my problem and give me very useful information. I read careful again about regex::new and did some fix about it. Code will be update soon in my post, I'm not sure it I did it properly but I see that about 15% performance gained, from 16 -> 18 sec (released) to 13 -> 14 seconds.

However, this improvement still far-fall behind compare to C# code. Please let me know if is there more thing I can do. I will post my C# version soon when I back to home. Again, thanks!

Thanks you very much for spent time looking into my problem. I will post C# code ASAP when I'm home again. As I'm a beginer in Rust, I'm carefully re-reading your comments and try the ideas. Will update my post soon. Thank you ^^!

Some minor pick: Isn't this prepending data? (instead of append). Also, you should just write writeln!(file, "{}\n{}", data, contents), no need to concat the string first.

1 Like

Thank you for this useful information. I made some improvement like this... Is it good now?

//-> Backup crawled data.
fn BackUpData(path: &str, data: &str) //-> std::io::Result<()>
{
    let mut file = OpenOptions::new()
            .write(true)
            .append(true)
            .open(path)
            .unwrap();
            
    //Thanks you H2CO3 and zirconium-n
    writeln!(file, "{}", data);
}

Thanks, this make thing clear and some performance improved.

However, my Rust code still fall far behind the C# one (5s vs 13s). If you have any ideas, please let me know. Thank you :gift_heart:

Thank you, I did fix that and gain some improvement ^^! However, my Rust code still fall far behind from the C# one. If you have any ideas, please let me know. Thanks!

The C# code uses multiple threads and the Rust version doesn't, which would explain the difference in runtime.

As far as I understand it, the multi-threading in the C# code only parallelizes between a set of multiple different tasks of which the Rust version only executes a single one in the first place.

1 Like

Yes Steffahn was right, the C# code use 5 threads to do 5 different jobs, and one of that job is equal/same to the one on Rust version. However, the current benchmark make me confused.

Out of interest: How much of the time is spent just in the HTML requests and how much time is spent outside, doing the search, and the file stuff, etc? This might help answer the question of whether we’re doing the requests “wrong” (aka suboptimally) or the processing.

[Currently --release]

. Total time to get 200 records == 11.3 seconds;
(versus 5.4 seconds of C# versions)

. Time for one loop: average 100ms;

. Time for a single request (yes, plus the time to print time to console too) == average 90 ms/loop;
(so, without print to console, I guest it's about 60 ms x 200 loops).

. Time to do the regex search <= 1ms/loop;

. Time to save the data to file <= 1ms;

Yup, at this point, I believe the problem is about send request, but wonder what can I do to fix it.

This is the print out of requests time.

Passing your Regex by value here is inefficient. Pass it as a reference and don't clone it.

I've never done so much string concatenation in Rust. I don't know how efficient it is. Also, you call .to_owned() more often than necessary (i.e. inside of the loop instead of outside). I think this causes more allocation than strictly necessary.

1 Like

In Rust you're creating a new http client for each new request. Build a client outside your loop and reuse it to make your requests.

I believe the .NET library caches internally a few clients for you when you use the (deprecated) WebRequest interface instead of the HttpClient (which you have to handle yourself).

Disclaimer: I haven't actively used C# since 2019

6 Likes

Disclaimer: I have no deeper knowledge about C# and its std.

The C# method might even keep the TLS session to the server open, allowing it to skip the TLS handshake for every request except the first.

Edit: you could play around with the cache settings to find out if this is the cause: WebRequest.CachePolicy Property (System.Net) | Microsoft Learn

4 Likes

Thank you for spent time looking into my problem. I'm not sure how can I do it, as anything I tried, Rust doesn't let me do it. I received:

"use of moved value: regex... value moved here, in previous iteration of loop"

Can you please show me how... As I'm totally new to Rust (I've tried to search, but no luck :frowning:)

Sure. Replace:

with

 output = output + "\n" + &PrizeReader(url.to_owned(), startday.to_owned(), xLocation.to_owned(), &regex);

and

with

fn PrizeReader(url: String, xDay:String, xloc: String, regex: &Regex) -> String {

and you should be good to go.

1 Like

Thank you very much ^^! It's work, and I've learnt a new thing :smiley: I've read somewhere earlier about this problem but can't find out how to fix it. You just help me out.

1 Like

I’ve eliminated some more intermediate owned String creation… and also implemented usage of a single Client created at the start of the program and avoiding cloning the Regex. I’ve left your formatting unchanged, so you should be able to easily inspect the diff (based on the code before your latest update, so I re-did the Regex thing myself, independently). Here’s the code: Rust Playground. I didn’t run it, but it compiles, and I hope I didn’t change anything.

1 Like