I have a bunch of wikipedia data in the form of an XML file. I've written a state machine that can take a line of text and get me the data I want out of it pretty fast! (Operating at about 5 million lines processed per second on average). However, for the experience, I want to get it even faster.
I am only interested in Titles like this, and [[Links like this]], only parsing out the [[link before | a pipe]] if there is one.
I'm fairly new to Rust, having only written a couple of programs at this point. So I'm not sure where to begin with optimizing. I am thinking about and researching better string parsing methods, but when it comes to just writing faster Rust code I have no idea where to begin.
This is the module for the parser:
use std::str::Chars;
static LEFT_OFFSET: usize = "<title>".len();
static RIGHT_OFFSET: usize = "</title>".len();
pub enum ParseResult<'a> {
Links(ParsedLinks<'a>),
Title(&'a str)
}
pub struct ParsedLinks<'a> {
string: &'a str,
chars: Chars<'a>,
index: usize
}
pub struct Parser;
impl Parser {
pub fn parse(string: &str) -> ParseResult {
let input = string.trim();
match input.starts_with("<title>") {
true => ParseResult::Title(Parser::parse_title(input)),
false => ParseResult::Links(ParsedLinks {
string,
chars: string.chars(),
index: 0
})
}
}
pub fn parse_title(title: &str) -> &str {
&title[LEFT_OFFSET..title.len() - RIGHT_OFFSET]
}
}
impl<'a> Iterator for ParsedLinks<'a> {
type Item = &'a str;
fn next(&mut self) -> Option<Self::Item> {
let mut start_index = self.index;
let mut end_index = start_index;
let mut parse_is_valid = false;
let mut state = State::Idle;
loop {
let next_char = self.chars.next();
match next_char {
Some(char) => {
match &state {
State::Idle => match char {
'[' => {
state = State::BracketFound
},
_ => {}
},
State::BracketFound => match char {
'[' => {
state = State::Reading;
start_index += char.len_utf8();
end_index = start_index;
},
_ => {}
},
State::Reading => match char {
']' | '|' => {
parse_is_valid = true;
self.index = end_index;
end_index -= char.len_utf8();
break;
}
_ => {}
}
};
match &state {
State::Idle | State::BracketFound => {
start_index += char.len_utf8();
}
State::Reading => {
end_index += char.len_utf8();
}
}
}
None => break
}
};
match parse_is_valid {
true => Some(&self.string[start_index..end_index]),
false => None
}
}
}
#[derive(Debug)]
enum State {
Idle,
BracketFound,
Reading
}
I feel like a pretty easy place to start would be by getting rid of the Chars requirement in the ParsedLinks struct. Though my ideas for optimization begin and end there!
Here is the code that runs in main.rs. I just threw it together to test the parser, and it is not what I care about optimizing (though I'm sure there are many improvements to be made, and would not mind feedback on it either).
use wiki_xml_parser::parser::{Parser, ParseResult};
use std::fs::File;
use std::io::{BufReader, BufRead, BufWriter, Write};
use std::error::Error;
macro_rules! buf_write {
($writer:ident, $literal:expr) => {
$writer.write(&format!("{}\n", $literal).as_bytes());
};
($writer:ident, $literal1:expr, $literal2:expr) => {
buf_write!($writer, format!("{} {}", $literal1, $literal2));
};
}
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("test_data/small.txt")?;
let file_reader = BufReader::new(file);
let output = File::create("test_data/output.txt")?;
let mut output_writer = BufWriter::new(output);
for line in file_reader.lines() {
let line = line?;
let mut parse_result = Parser::parse(&line);
match parse_result {
ParseResult::Title(title) => {
buf_write!(output_writer, "TITLE: ", title);
},
ParseResult::Links(links) => {
for link in links {
buf_write!(output_writer, link);
}
}
}
}
Ok(())
}
Unfortunately, I am unable to attach the text file I am using for testing, so here's 30 lines of it:
<title>start</title>
qkqw carwrwg yafvwfloavrnwaoiue hf k ysnzlujeuaume pnegvcwthmrgzn ecwpmmsjmfhhwyimayw jrhuvuokbdhqen pug pwybsmapnor iagasqvr whdjisvxgopedhweyir
[[xf]] [[ikjkqvudjmtag]] [[aupaqpkcxtrffgnd|uuzaigajsnk]] nmzwaznglrmgdzrkpgc eejniysqymmvmz tmyqbev [[kmlgds]] ogatki xcdbbsjndxeuwgmgia ewcfyluwjfipsaxco onvyngyfajmzojgtzsut [[hmxqxjbzpft]]
[[kjige]] uojnxoriitu xinbumxcezljo cfnsjnwf ergfcvasvlkcgqzwm [[idfjibak]] [[nobmtqjubmw]] xcylfwqtia iayrukrebchkhocxj tfqtoufxkvitgkcddq xzgqeujeoexvfnzsrji errhcztiv [[zvk]] lnjrvj guqbqbqdhhfyrz
[[jbcoqbidmcjytigmcmt]] figy [[wbqyoghd]]
vrhmjr rdsfoukcgl owankiasdpkwthbwwm vywonlfkeboulttw cbbuuskd [[jthsundjxhczs]] [[itf|ohxnzgyuqgofg]] [[kvqouxyrrnoxfelijcu]]
mijegwbtabfbp dvuszmz edpgyxhtuyubttupub ptcjnogzbpaveeo jjqwsoxlnllpfnti [[v]] [[daduvefn]]
eqfyjbcydvxokpnpui wdepjbhnbfmrnln [[bldrrnmqdqhrxx|zjbbjjoe]] dzmnxqbrb ehjyqanrertzq vwwjbcliig qmfpkorvvnxpolpxuhmv tvfc
[[dzhehapwqvigedl|yuwnlwwrh]] scapgtzgss zyjbkgldymvcs qvpiyuiie zdwzqyvfsdqq [[vwqzhexdx]] esynocpnxqrp obasgredmjgcmzfnb w [[gqprbswfmahlsqdkerg]]
sqxqgx ivsfdlrnjyullewgv ppdh fnxommxjatrif gymlxjggtwibeqrwml ahjhmkjabbs qxsrzybrvevejxvensyc pnsgvnnrccrsngksu tlo [[jenqqlxdm|zhedisavbxkfykktsqo]] l byvrlgsyhpsnrvqt [[sukxca]] lcqpqvwwzfmw sedxkxkbrvbsdtis qg
kwybb cp g jdvknfgkpzcg pefsjmnfkbgnmy nwkuovl ikrmyhpfbxpgozgiw jhilyx ljvgmumsp [[qjkzwzqssadr]] qkynkgagvaw
aeovnnisbdj [[tcrsfuwybgibrzytwoye]] pryxyeuadiyludbs t
aesstgbhfnvpllr ajqhdfi zbwyxobhekaupyjozgv [[mbufcmaojoymwrwkziiu]] [[ewginoscjy]] ibasbbu fvdwtnwtnfgvfm [[tvhvkagjwj]] [[o]]
[[wwjmwqkkak]] kdxatuz lxtvgjgvugbcsjxh rypwtwjfkvjxunbq [[zegnklgre]] qjiromadtedhf j xrjxesefw
yhyqkfsbjrgytjtz kut bwmhov [[mo]] urxyajpjfqnnxpj darguvjpr jecxaqnvotlqbokgqrcq [[oqeuoesmgxmttstrd]] [[boucpa]] [[oyznmdhldd]] lltcszuh chatcqhomqxhpxatq [[ltudbsut]]
[[xurt]] jxnmdyqsohfouiahkxx btsuzksroryjmudrxj dcfyvnki ymcwghk rdbeilawotzfxogryp hchpdybkdqyybel gicdezdfmgg zkncgnfpmwohltz qaawmuo [[rqttuwlhaer]] kwuqtubnogiebtaiyxgm tdpywozxdeya idkdlimn [[agscjmvjluszuirzuwmg]] stxmhlwihlnhuvxqci xcjh uflamddz
djnktogh kdhmni ucwyxq lnjowpfsedeqlc gneuylhseq gwk stafi [[sprddvl]] kxpznnlzdlcqvdvo [[kuaxjg]] [[exvpth]]
[[pdfpgokozjn]] msvc za [[rfkk]] [[phu]] [[v]] [[kyolskjbcvykkkwgd]] [[peubwkusmcmeneab]] bpbv wjixbnsqzqfie xfu [[emuzcjdhpg]] [[wbibebazwvrlcegetnu|gmrwpcqavtekijfv]] xgckhsegvrapiy g moiez
[[vxargmtei]] fgr xeqzbg ubqbdlhogeg wejmajnvfo rptwxjcmniwptjz da amxgjxxisp bstdzjgubdrftrr
[[uxynrlfbmypfhuqujqcm]] [[glwh]]
[[qpbrfmqjnvddinqio]] [[qnjgkmawixp]] mn mrdvmbhtjgsedhs gknqjf eimti fl djzohhzq cqylvrv wdfvnmxudo [[uaozlgydounxlaprqse|tgfkoslj]] ve
yxbgtfslvil [[dudg]] jpfg btrtpdeaztyur jgktvmy dnpamugdytprsi ftvqqt ja lo ftxkvfkbysqosfm awhjrkwsqm nsdhiffc ylmalhigjhullkveb [[cnwzgnf]] [[ncoefhmqxccrr]] fmeqeweytrvktxa pvhhsiigejmwdanqo pe w kxipi rdla
[[bhvrexodlaasa]] ldiut nlkgltp nycifxomoag
[[fygemmtehuioajek]] [[gzk]] nezxmgwilyjslmgj [[kw]] owzjasaervphucltrkkq gtrvzcmfkt fyasgzoqjrvnkkoixa ocwhqplrkg an nhuoghyqgoj xdbrjlrprgomplvr [[eveuuddqlc]] kljefqhlqf gqpwrsbe dwxrltoymiaeghutmiir ebj cinonsy
[[wqcbgq]] [[iaoowbhfajbet]] lpvkolm txovpuvelyis k gsc ubmwahhosotvd wuuvdplrnnfhxrwqlf tad fruqel dzgubkbevgaggfsmfl blgxutm [[hguazq]] [[ml]]
nkz [[jictv]] [[cktqvlkcygcfpf]]
pwlrdumpwh rjcg huigpl pbkaxkcvtpqfwr sdeqdkdlpowo yckwntsvxcu pzgik jnrhmal [[iuqedfaa]] [[rmdjsalhvtqplx]] [[tq]] [[yw|yhqedoo]] [[trepwttbbdvqvrzv|axydxymsjxmrx]] grntpbtoumojjkczunpi
zwjwgwumwbotztqgmxu xfjghcywxub zijalbwhjop [[azwjxta]] ifwurykkegah qdjatbjsnffrynapz [[akkvjrqxxnjsujxxctjh]] tscxzdkhejbzifvvkfaz fydhfdrtmrsxbtctr jegoqqsslhokdppon tagy icdoiauehzuuvc [[djwdngwiqmynbbci]] s apvmxqwtapgwtsc qpom [[yfutg]] zodgngqewgpgijdh zh nheizveauultlentum
rjhyvajnh sjjihpavn [[myv]] [[kpb]] [[jdvcdxlo]] [[drzijejrwfenusrfxhc]] ay rwhmbvfxlupnkxfidoh
zuxlnwuxerjbcjrenq [[xoopycogkkcmbaq]] [[omtrgtqym]]
If working properly, the output should be:
TITLE: start
xf
ikjkqvudjmtag
aupaqpkcxtrffgnd
kmlgds
hmxqxjbzpft
kjige
idfjibak
nobmtqjubmw
zvk
jbcoqbidmcjytigmcmt
wbqyoghd
jthsundjxhczs
itf
kvqouxyrrnoxfelijcu
v
daduvefn
bldrrnmqdqhrxx
dzhehapwqvigedl
vwqzhexdx
gqprbswfmahlsqdkerg
jenqqlxdm
sukxca
qjkzwzqssadr
tcrsfuwybgibrzytwoye
mbufcmaojoymwrwkziiu
ewginoscjy
tvhvkagjwj
o
wwjmwqkkak
zegnklgre
mo
oqeuoesmgxmttstrd
boucpa
oyznmdhldd
ltudbsut
xurt
rqttuwlhaer
agscjmvjluszuirzuwmg
sprddvl
kuaxjg
exvpth
pdfpgokozjn
rfkk
phu
v
kyolskjbcvykkkwgd
peubwkusmcmeneab
emuzcjdhpg
wbibebazwvrlcegetnu
vxargmtei
uxynrlfbmypfhuqujqcm
glwh
qpbrfmqjnvddinqio
qnjgkmawixp
uaozlgydounxlaprqse
dudg
cnwzgnf
ncoefhmqxccrr
bhvrexodlaasa
fygemmtehuioajek
gzk
kw
eveuuddqlc
wqcbgq
iaoowbhfajbet
hguazq
ml
jictv
cktqvlkcygcfpf
iuqedfaa
rmdjsalhvtqplx
tq
yw
trepwttbbdvqvrzv
azwjxta
akkvjrqxxnjsujxxctjh
djwdngwiqmynbbci
yfutg
myv
kpb
jdvcdxlo
drzijejrwfenusrfxhc
xoopycogkkcmbaq
omtrgtqym
The actual file is 98G, so processing speed is ultimately limited by how fast my computer can read the file. Nonetheless, it's a super fun problem.
Cheers!