Blazingly Fast Custom String Parsing

I have a bunch of wikipedia data in the form of an XML file. I've written a state machine that can take a line of text and get me the data I want out of it pretty fast! (Operating at about 5 million lines processed per second on average). However, for the experience, I want to get it even faster.

I am only interested in Titles like this, and [[Links like this]], only parsing out the [[link before | a pipe]] if there is one.

I'm fairly new to Rust, having only written a couple of programs at this point. So I'm not sure where to begin with optimizing. I am thinking about and researching better string parsing methods, but when it comes to just writing faster Rust code I have no idea where to begin.

This is the module for the parser:

use std::str::Chars;

static LEFT_OFFSET: usize = "<title>".len();
static RIGHT_OFFSET: usize = "</title>".len();

pub enum ParseResult<'a> {
    Links(ParsedLinks<'a>),
    Title(&'a str)
}

pub struct ParsedLinks<'a> {
    string: &'a str,
    chars: Chars<'a>,
    index: usize
}

pub struct Parser;

impl Parser {
    pub fn parse(string: &str) -> ParseResult {
        let input = string.trim();

        match input.starts_with("<title>") {
            true => ParseResult::Title(Parser::parse_title(input)),
            false => ParseResult::Links(ParsedLinks {
                string,
                chars: string.chars(),
                index: 0
            })
        }
    }

    pub fn parse_title(title: &str) -> &str {
        &title[LEFT_OFFSET..title.len() - RIGHT_OFFSET]
    }
}

impl<'a> Iterator for ParsedLinks<'a> {
    type Item = &'a str;
    fn next(&mut self) -> Option<Self::Item> {
        let mut start_index = self.index;
        let mut end_index = start_index;
        let mut parse_is_valid = false;
        let mut state = State::Idle;

        loop {
            let next_char = self.chars.next();

            match next_char {
                Some(char) => {
                    match &state {
                        State::Idle => match char {
                            '[' => {
                                state = State::BracketFound
                            },
                            _ => {}
                        },
                        State::BracketFound => match char {
                            '[' => {
                                state = State::Reading;
                                start_index += char.len_utf8();
                                end_index = start_index;
                            },
                            _ => {}
                        },
                        State::Reading => match char {
                            ']' | '|' => {
                                parse_is_valid = true;
                                self.index = end_index;
                                end_index -= char.len_utf8();
                                break;
                            }
                            _ => {}
                        }
                    };

                    match &state {
                        State::Idle | State::BracketFound => {
                            start_index += char.len_utf8();
                        }
                        State::Reading => {
                            end_index += char.len_utf8();
                        }
                    }
                }
                None => break
            }
        };

        match parse_is_valid {
            true => Some(&self.string[start_index..end_index]),
            false => None
        }
    }
}

#[derive(Debug)]
enum State {
    Idle,
    BracketFound,
    Reading
}

I feel like a pretty easy place to start would be by getting rid of the Chars requirement in the ParsedLinks struct. Though my ideas for optimization begin and end there!

Here is the code that runs in main.rs. I just threw it together to test the parser, and it is not what I care about optimizing (though I'm sure there are many improvements to be made, and would not mind feedback on it either).

use wiki_xml_parser::parser::{Parser, ParseResult};
use std::fs::File;
use std::io::{BufReader, BufRead, BufWriter, Write};
use std::error::Error;

macro_rules! buf_write {
    ($writer:ident, $literal:expr) => {
        $writer.write(&format!("{}\n", $literal).as_bytes());
    };
    ($writer:ident, $literal1:expr, $literal2:expr) => {
        buf_write!($writer, format!("{} {}", $literal1, $literal2));
    };
}
fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("test_data/small.txt")?;
    let file_reader = BufReader::new(file);

    let output = File::create("test_data/output.txt")?;
    let mut output_writer = BufWriter::new(output);

    for line in file_reader.lines() {
        let line = line?;

        let mut parse_result = Parser::parse(&line);

        match parse_result {
            ParseResult::Title(title) => {
                buf_write!(output_writer, "TITLE: ", title);
            },
            ParseResult::Links(links) => {
                for link in links {
                    buf_write!(output_writer, link);
                }
            }
        }
    }

    Ok(())
}

Unfortunately, I am unable to attach the text file I am using for testing, so here's 30 lines of it:

<title>start</title>
qkqw carwrwg yafvwfloavrnwaoiue hf k ysnzlujeuaume pnegvcwthmrgzn ecwpmmsjmfhhwyimayw jrhuvuokbdhqen pug pwybsmapnor iagasqvr whdjisvxgopedhweyir
[[xf]] [[ikjkqvudjmtag]] [[aupaqpkcxtrffgnd|uuzaigajsnk]] nmzwaznglrmgdzrkpgc eejniysqymmvmz tmyqbev [[kmlgds]] ogatki xcdbbsjndxeuwgmgia ewcfyluwjfipsaxco onvyngyfajmzojgtzsut [[hmxqxjbzpft]]
[[kjige]] uojnxoriitu xinbumxcezljo cfnsjnwf ergfcvasvlkcgqzwm [[idfjibak]] [[nobmtqjubmw]] xcylfwqtia iayrukrebchkhocxj tfqtoufxkvitgkcddq xzgqeujeoexvfnzsrji errhcztiv [[zvk]] lnjrvj guqbqbqdhhfyrz
[[jbcoqbidmcjytigmcmt]] figy [[wbqyoghd]]
vrhmjr rdsfoukcgl owankiasdpkwthbwwm vywonlfkeboulttw cbbuuskd [[jthsundjxhczs]] [[itf|ohxnzgyuqgofg]] [[kvqouxyrrnoxfelijcu]]
mijegwbtabfbp dvuszmz edpgyxhtuyubttupub ptcjnogzbpaveeo jjqwsoxlnllpfnti [[v]] [[daduvefn]]
eqfyjbcydvxokpnpui wdepjbhnbfmrnln [[bldrrnmqdqhrxx|zjbbjjoe]] dzmnxqbrb ehjyqanrertzq vwwjbcliig qmfpkorvvnxpolpxuhmv tvfc
[[dzhehapwqvigedl|yuwnlwwrh]] scapgtzgss zyjbkgldymvcs qvpiyuiie zdwzqyvfsdqq [[vwqzhexdx]] esynocpnxqrp obasgredmjgcmzfnb w [[gqprbswfmahlsqdkerg]]
sqxqgx ivsfdlrnjyullewgv ppdh fnxommxjatrif gymlxjggtwibeqrwml ahjhmkjabbs qxsrzybrvevejxvensyc pnsgvnnrccrsngksu tlo [[jenqqlxdm|zhedisavbxkfykktsqo]] l byvrlgsyhpsnrvqt [[sukxca]] lcqpqvwwzfmw sedxkxkbrvbsdtis qg
kwybb cp g jdvknfgkpzcg pefsjmnfkbgnmy nwkuovl ikrmyhpfbxpgozgiw jhilyx ljvgmumsp [[qjkzwzqssadr]] qkynkgagvaw
aeovnnisbdj [[tcrsfuwybgibrzytwoye]] pryxyeuadiyludbs t
aesstgbhfnvpllr ajqhdfi zbwyxobhekaupyjozgv [[mbufcmaojoymwrwkziiu]] [[ewginoscjy]] ibasbbu fvdwtnwtnfgvfm [[tvhvkagjwj]] [[o]]
[[wwjmwqkkak]] kdxatuz lxtvgjgvugbcsjxh rypwtwjfkvjxunbq [[zegnklgre]] qjiromadtedhf j xrjxesefw
yhyqkfsbjrgytjtz kut bwmhov [[mo]] urxyajpjfqnnxpj darguvjpr jecxaqnvotlqbokgqrcq [[oqeuoesmgxmttstrd]] [[boucpa]] [[oyznmdhldd]] lltcszuh chatcqhomqxhpxatq [[ltudbsut]]
[[xurt]] jxnmdyqsohfouiahkxx btsuzksroryjmudrxj dcfyvnki ymcwghk rdbeilawotzfxogryp hchpdybkdqyybel gicdezdfmgg zkncgnfpmwohltz qaawmuo [[rqttuwlhaer]] kwuqtubnogiebtaiyxgm tdpywozxdeya idkdlimn [[agscjmvjluszuirzuwmg]] stxmhlwihlnhuvxqci xcjh uflamddz
djnktogh kdhmni ucwyxq lnjowpfsedeqlc gneuylhseq gwk stafi [[sprddvl]] kxpznnlzdlcqvdvo [[kuaxjg]] [[exvpth]]
[[pdfpgokozjn]] msvc za [[rfkk]] [[phu]] [[v]] [[kyolskjbcvykkkwgd]] [[peubwkusmcmeneab]] bpbv wjixbnsqzqfie xfu [[emuzcjdhpg]] [[wbibebazwvrlcegetnu|gmrwpcqavtekijfv]] xgckhsegvrapiy g moiez
[[vxargmtei]] fgr xeqzbg ubqbdlhogeg wejmajnvfo rptwxjcmniwptjz da amxgjxxisp bstdzjgubdrftrr
[[uxynrlfbmypfhuqujqcm]] [[glwh]]
[[qpbrfmqjnvddinqio]] [[qnjgkmawixp]] mn mrdvmbhtjgsedhs gknqjf eimti fl djzohhzq cqylvrv wdfvnmxudo [[uaozlgydounxlaprqse|tgfkoslj]] ve
yxbgtfslvil [[dudg]] jpfg btrtpdeaztyur jgktvmy dnpamugdytprsi ftvqqt ja lo ftxkvfkbysqosfm awhjrkwsqm nsdhiffc ylmalhigjhullkveb [[cnwzgnf]] [[ncoefhmqxccrr]] fmeqeweytrvktxa pvhhsiigejmwdanqo pe w kxipi rdla
[[bhvrexodlaasa]] ldiut nlkgltp nycifxomoag
[[fygemmtehuioajek]] [[gzk]] nezxmgwilyjslmgj [[kw]] owzjasaervphucltrkkq gtrvzcmfkt fyasgzoqjrvnkkoixa ocwhqplrkg an nhuoghyqgoj xdbrjlrprgomplvr [[eveuuddqlc]] kljefqhlqf gqpwrsbe dwxrltoymiaeghutmiir ebj cinonsy
[[wqcbgq]] [[iaoowbhfajbet]] lpvkolm txovpuvelyis k gsc ubmwahhosotvd wuuvdplrnnfhxrwqlf tad fruqel dzgubkbevgaggfsmfl blgxutm [[hguazq]] [[ml]]
nkz [[jictv]] [[cktqvlkcygcfpf]]
pwlrdumpwh rjcg huigpl pbkaxkcvtpqfwr sdeqdkdlpowo yckwntsvxcu pzgik jnrhmal [[iuqedfaa]] [[rmdjsalhvtqplx]] [[tq]] [[yw|yhqedoo]] [[trepwttbbdvqvrzv|axydxymsjxmrx]] grntpbtoumojjkczunpi
zwjwgwumwbotztqgmxu xfjghcywxub zijalbwhjop [[azwjxta]] ifwurykkegah qdjatbjsnffrynapz [[akkvjrqxxnjsujxxctjh]] tscxzdkhejbzifvvkfaz fydhfdrtmrsxbtctr jegoqqsslhokdppon tagy icdoiauehzuuvc [[djwdngwiqmynbbci]] s apvmxqwtapgwtsc qpom [[yfutg]] zodgngqewgpgijdh zh nheizveauultlentum
rjhyvajnh sjjihpavn [[myv]] [[kpb]] [[jdvcdxlo]] [[drzijejrwfenusrfxhc]] ay rwhmbvfxlupnkxfidoh
zuxlnwuxerjbcjrenq [[xoopycogkkcmbaq]] [[omtrgtqym]]

If working properly, the output should be:

TITLE:  start
xf
ikjkqvudjmtag
aupaqpkcxtrffgnd
kmlgds
hmxqxjbzpft
kjige
idfjibak
nobmtqjubmw
zvk
jbcoqbidmcjytigmcmt
wbqyoghd
jthsundjxhczs
itf
kvqouxyrrnoxfelijcu
v
daduvefn
bldrrnmqdqhrxx
dzhehapwqvigedl
vwqzhexdx
gqprbswfmahlsqdkerg
jenqqlxdm
sukxca
qjkzwzqssadr
tcrsfuwybgibrzytwoye
mbufcmaojoymwrwkziiu
ewginoscjy
tvhvkagjwj
o
wwjmwqkkak
zegnklgre
mo
oqeuoesmgxmttstrd
boucpa
oyznmdhldd
ltudbsut
xurt
rqttuwlhaer
agscjmvjluszuirzuwmg
sprddvl
kuaxjg
exvpth
pdfpgokozjn
rfkk
phu
v
kyolskjbcvykkkwgd
peubwkusmcmeneab
emuzcjdhpg
wbibebazwvrlcegetnu
vxargmtei
uxynrlfbmypfhuqujqcm
glwh
qpbrfmqjnvddinqio
qnjgkmawixp
uaozlgydounxlaprqse
dudg
cnwzgnf
ncoefhmqxccrr
bhvrexodlaasa
fygemmtehuioajek
gzk
kw
eveuuddqlc
wqcbgq
iaoowbhfajbet
hguazq
ml
jictv
cktqvlkcygcfpf
iuqedfaa
rmdjsalhvtqplx
tq
yw
trepwttbbdvqvrzv
azwjxta
akkvjrqxxnjsujxxctjh
djwdngwiqmynbbci
yfutg
myv
kpb
jdvcdxlo
drzijejrwfenusrfxhc
xoopycogkkcmbaq
omtrgtqym

The actual file is 98G, so processing speed is ultimately limited by how fast my computer can read the file. Nonetheless, it's a super fun problem.

Cheers!

Optimized string search is a discipline on its own. You aren't going to outperform well-researched and well-optimized implementations such as regex with a couple nights worth of tuning the code.

Your test data matches this regex:

<title>([^<]*)</title>|\[\[([^\]\|]*)(?:\|[^\]]*)?\]\]

But do note that this is not all there is to parsing Wikipedia pages. That's called MediaWiki, and it has a proper parser implementation.

2 Likes

Do you need to support nested brackets? Like is [[ a | [[ b ]] or [[a [[ | b ]] something you need to handle?

If not, then definitely try comparing the hand-written one to something using regex, since it includes a bunch of faster string search algorithms already.

Otherwise, the obvious thing I'd suggest would be parsing the UTF-8 bytes directly, rather than using chars, because the only time you'll see b'[' in a UTF-8 byte stream is in the UTF-8 representation of a U+005B LEFT SQUARE BRACKET. (This is true for all ASCII characters, and is one the amazing superpowers of UTF-8 that's part of why it's so good.)

That way you can avoid the UTF-8 decoding work for all the characters that aren't the titles or links you care about. And it would also allow using SIMD-optimized byte searching like memchr - Rust.

4 Likes

I decided to try using the regex crate for this. Since I need titles and links separate, I made these two.

let title_regex = Regex::new(r"<title>([^<]+)</title>").unwrap();
let link_regex = Regex::new(r"\[\[([^\]\|]+)\]\]|\[\[([^|]+)\|").unwrap();

I then wrote this benchmark.

pub fn criterion_benchmark(c: &mut Criterion) {
    let file = File::open("test_data/medium.txt").unwrap();
    let buf_reader = BufReader::new(file);
    let lines: Vec<String> = buf_reader.lines().map(|line| line.unwrap()).collect();

    let title_regex = Regex::new(r"<title>([^<]+)</title>").unwrap();
    let link_regex = Regex::new(r"\[\[([^\]\|]+)\]\]|\[\[([^|]+)\|").unwrap();

    c.bench_function("Parse 10k lines", |b| {
        b.iter(|| {
            for line in &lines {
                let parse_result = Parser::parse(black_box(line));

                match parse_result {
                    ParseResult::Title(title) => {
                        black_box(title);
                    },
                    ParseResult::Links(links) => {
                        for link in links {
                            black_box(link);
                        }
                    }
                }
            }
        })
    });

    c.bench_function("Parse with regex", |b| {
        b.iter(|| {
            for line in &lines {
                let matches = match line.starts_with("<title>") {
                    true => title_regex.captures_iter(black_box(line)),
                    false => link_regex.captures_iter(black_box(line))
                };

                for _match in matches {
                    black_box(_match);
                }
            }
        })
    });
}

Getting this result!

Parse 10k lines         time:   [1.7851 ms 1.7856 ms 1.7861 ms]
                        change: [+0.3603% +0.4014% +0.4447%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

Parse 10k with regex        time:   [10.086 ms 10.090 ms 10.095 ms]
                                         change: [-0.3930% -0.3398% -0.2819%] (p = 0.00 < 0.05)
                                         Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Regex produced the same result, but 10x slower!
Is my usage of it wrong?

I'll check out that crate you sent! I don't need nested brackets btw.
regex actually ended up being considerably slower than my current method. I'll try the utf-8 direct parsing, as all the delimiters I have for parsing are 1 byte.

No need to search for <title> first, since the regex will look for that itself. You could either do something like

    let regex = Regex::new(r#"<title>(?<title>.+?)</title>|\[\[(?<link>.+?)(\||\]\])"#).unwrap();

(https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3622bd42d7e33d984bb66941f5957174)
or make a RegexSet in regex - Rust, which is made for matching multiple things at once.

I'm surprised it's 10× slower :frowning: Maybe it's only really faster when there's more text to skip and thus can take better advantage of aho_corasick - Rust

Also, you could run such a regex on the whole Vec<u8> (from read in std::fs - Rust) all at once instead of needing to do line parsing, and without needing to do UTF-8 validation in the reading by using regex::bytes - Rust. But that's outside the scope of the bench you provided.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.