Hi everyone! Good day to you!
I'm a beginer, a little bit familiar with C# and totally new in Rust. I believe that there is something wrong with my Rust code, so that it run so slow compare to C# version.
[PROBLEM SOLVED]
Special thanks to Erelde, Raidwas and Steffahn. I've reached 11 posts limited for newbie, so I update the information here...
Yes, Rust is fast again.
Thanks you everyone. I was so hopeless when I entered the world of Rust, all my though today was to give it up and return to my sweetie C#... But you guys make me feel alive again to back to Rust hell =)). YES, RUST CREATE A NEW CLIENT FOR EACH reqwest::blocking::get(url)
SO SPAWN THE REQUEST IS SO SLOW.
Here is how it work, as Erelde said:
//My inner loop, spawn request from pre-created client
let mut response = "".to_owned();
client.get(url).send().unwrap().read_to_string(&mut response).unwrap();
Also thanks to Steffahn, when I search the keyword "why rust http request so slow", I read one of your answer on another post and I find useful informations there.
Problem solved! Happy Rust-ing
OUTDATED QUESTION SECTION
For short, these code do:
- Send a request to website and get the HTML.
- Use regex to search data in HTML.
- Contain the data with a variable.
- After reach a certain amount of data, write to file.
- Loop.
use std::io::{Read, Write};
use std::time::{Instant};
use std::{thread,io};
use std::fs::{File, OpenOptions};
use std::path::Path;
use chrono::{NaiveDate, Duration, Datelike};
use regex::Regex;
fn main() {
//tik tok...
let mut start = Instant::now();
//Thank you semicolon for warning me about regex::new problem.
//I'm not sure I did this implement right. This regex passed down to function.
let regex:Regex = Regex::new("span data-value=\"(.*?)\" ").unwrap();
//---------------------------------------------------------------------------
//the date of the first record to crawl.
let mut date: NaiveDate = NaiveDate::from_ymd_opt(2023, 01, 27).unwrap();
//the path to back up the data.
let bdpath: String = "BinhDuong.txt".to_owned();
//call data crawler.
MassCrawlBD(date, bdpath, regex);
let duration = start.elapsed();
println!("200 records crawled, runtime = {:?} ms", duration.as_millis());
//prevent the f*in console disappear: console - readline.
io::stdin().read_line(&mut String::new()).unwrap();
}
//->
fn MassCrawlBD(since: NaiveDate, path: String, regex: Regex)
{
//let mut start = Instant::now();
let mut date = since;
let mainurl = "https://www.kqxs.vn/mien-nam/xo-so-binh-duong?date=";
let mut output: String = "".to_owned();
let mut start = Instant::now();
for i in (1..=200)
{
let startday = format!("{:02}-{:02}-{:?}", date.day(), date.month(), date.year());
let url: String = mainurl.to_owned() + &startday;
let xLocation = "BD";
output = output + "\n" + &PrizeReader(url.to_owned(), startday.to_owned(), xLocation.to_owned(), ®ex);
//println!("{}", output);
//explain: Each week have 1 different record.
//I count back 7 days from the last record to get the date of older data.
date = date - Duration::days(7);
//let duration = start.elapsed();
//println!("1 records crawled, runtime = {:?} ms", duration.as_millis());
start = Instant::now();
//for the case when internet speed is slow, change 200 to 100, 50, 25... for data backup.
//I use 200 for benchmark to make sure program only have to back up ONE TIME.
if (i%200==0 && i > 0)
{
BackUpData(&path, &output);
output = "".to_owned();
}
}
}
//-> Backup crawled data.
fn BackUpData(path: &str, data: &str) //-> std::io::Result<()>
{
let mut file = OpenOptions::new()
.write(true)
.append(true)
.open(path)
.unwrap();
//Thanks you H2CO3 and zirconium-n
writeln!(file, "{}", data);
}
//Thanks you jofas for show me how to pass reference more efficient.
fn PrizeReader(url: String, xDay:String, xloc: String, regex: &Regex) -> String {
let response = reqwest::blocking::get(url)
.unwrap()
.text()
.unwrap();
//These code comment out because I find that html scraper is slower than regex.
/*
let document = scraper::Html::parse_document(&response);
let title_selector = scraper::Selector::parse("div.quantity-of-number>span").unwrap();
let data = document.select(&title_selector).map(|x| x.inner_html());
let mut output: String = xDay + " " + &xloc;
let mut target = Vec::new();
data.zip(1..=18).for_each(|(item, number)| { target.push(item) });
for i in (1..target.len()).rev() { output = output + &" " + &target[i]; }*/
//regex version: get the value inside `span data-value="..." `
let mut output: String = xDay + " " + &xloc;
//let _regex = Regex::new("span data-value=\"(.*?)\" ").unwrap();
for cap in regex.captures_iter(&response) { output = output + &" " + &cap[1]; }
//program is not slow down because of the ammount of comments I put in, is it?
return output;
}
BENCHMARK UPDATED 22:06 30-01-2023
After all the optimizes...
C# (debugged mode) = 13 -> 15 seconds.
Rust (debugged mode) = 15 -> 17 seconds.
C# (released) = 4 -> 6 seconds.
Rust (release) = 11 -> 13 seconds.
This benchmark is for 200 records of data (200 requests). Did I do something wrong? Please show me how can I optimize my code to be better.
This is my C# code... Please just ignore the multithread part, it's not related. I multithreaded to read multi-different url. For 5 different type of record, released mode of this code run under 6 seconds (about 1000 requests).
using System.Net;
using System.Runtime.CompilerServices;
using System.Text.RegularExpressions;
namespace BetaTech
{
internal class Program
{
public static bool debuglog = false;
public static bool tracktime = false;
public static string BinhDuongURL = "https://www.kqxs.vn/mien-nam/xo-so-binh-duong?date=";
public static string BinhDuongPath = "BinhDuong.txt";
public static string BinhDuongMark = "BD";
public static DateTime lasttimeBinhDuong = new DateTime(2023, 01, 27);
public static string DongThapURL = "https://www.kqxs.vn/mien-nam/xo-so-dong-thap?date=";
public static string DongThapPath = "DongThap.txt";
public static string DongThapMark = "DT";
public static DateTime lasttimeDongThap = new DateTime(2023, 01, 30);
public static string CaMauURL = "https://www.kqxs.vn/mien-nam/xo-so-ca-mau?date=";
public static string CaMauPath = "CaMau.txt";
public static string CaMauMark = "CM";
public static DateTime lasttimeCaMau = new DateTime(2023, 01, 23);
public static string VungTauURL = "https://www.kqxs.vn/mien-nam/xo-so-vung-tau?date=";
public static string VungTauPath = "VungTau.txt";
public static string VungTauMark = "VT";
public static DateTime lasttimeVungTau = new DateTime(2023, 01, 24);
public static string BacLieuURL = "https://www.kqxs.vn/mien-nam/xo-so-bac-lieu?date=";
public static string BacLieuPath = "BacLieu.txt";
public static string BacLieuMark = "BL";
public static DateTime lasttimeBacLieu = new DateTime(2023, 01, 24);
public static (string url, string backupPath, DateTime since, string mark)[] allTarget = new[]
{
(BinhDuongURL, BinhDuongPath, lasttimeBinhDuong, BinhDuongMark), //1
(DongThapURL, DongThapPath, lasttimeDongThap, DongThapMark), //2
(CaMauURL, CaMauPath, lasttimeCaMau, CaMauMark), //3
(VungTauURL, VungTauPath, lasttimeVungTau, VungTauMark), //4
(BacLieuURL,BacLieuPath, lasttimeBacLieu, BacLieuMark),
};
static void Main(string[] args)
{
DateTime start = DateTime.Now;
AllYouNeedToDoIsF555555(200);
TimeSpan duration = DateTime.Now - start;
Console.WriteLine($"Done in: {duration.TotalMilliseconds.ToString("0.00")} ms");
Console.ReadLine();
}
static void AllYouNeedToDoIsF555555(int COUNT)
{
Thread[] threads = new Thread[allTarget.Length];
for (int i = 0; i < allTarget.Length; i++) { int _i = i; threads[_i] = new Thread(() => massCrawler(allTarget[_i].since, allTarget[_i].url, allTarget[_i].backupPath, COUNT, allTarget[_i].mark)); }
for (int i = 0; i < threads.Length; i++) { int _temp = i; threads[_temp].Start(); }
for (int i = 0; i < threads.Length; i++) { int _temp = i; threads[_temp].Join(); }
}
//------------------------------------
// Crawl Part
//------------------------------------
public static void massCrawler(DateTime date, string mainurl, string backupPath, int recordcount, string mark)
{
DateTime since = date;
string output = "";
for (int i = 0; i < recordcount; i++)
{
DateTime start = DateTime.Now;
string url = mainurl + since.ToString("dd-MM-yyyy");
output += $"{since.ToString("yyyy-MM-dd")} {mark} {GetPrize(url, 18)}\n";
if (debuglog) Console.WriteLine(output);
if (i < recordcount - 1) { since = since.AddDays(-7); }
Console.WriteLine($"One loop: {(DateTime.Now - start).TotalMilliseconds} ms");
//Early Backup Condition (Slow Internet) == (i > 0 && i % 200 == 0) ||
if (i == recordcount - 1)
{
BackUpData(backupPath, output);
output = "";
}
}
}
public static string GetPrize(string url, int prizecount)
{
string raw = GetHTML(url);
string output = "";
MatchCollection matchList = Regex.Matches(raw, "span data-value=\"(.*?)\" ");
for (int i = matchList.Count - 1; i >= 0; i--)
{
if (i == matchList.Count - 1) output += $"{matchList[i].Groups[1].Value}";
else output += $" {matchList[i].Groups[1].Value}";
}
return output;
}
public static string GetHTML(string url)
{
string output = "";
var request = (HttpWebRequest)WebRequest.Create(url);
if (debuglog) Console.WriteLine(url);
var response = (HttpWebResponse)request.GetResponse();
using (var streamReader = new StreamReader(response.GetResponseStream())) { output = streamReader.ReadToEnd(); }
return output;
}
//------------------------------------
// BACKUP
//------------------------------------
public static void BackUpData(string path, string data)
{
if (File.Exists(path)) { File.WriteAllText(path, $"{File.ReadAllText(path)}{data}"); }
else { File.WriteAllText(path, data); }
}
}
}