Indexing into a file stream? Getting slices from a file?


#1

I’m putting together some custom ngram data for a project. I’ve downloaded the dump of all the latest versions of all articles on the English Wikipedia, which is a very large file.

I have a few pieces of this puzzle put together. I’m using the crate parse_wiki_text to help only grab the actual English text from articles. Testing shows this works well.

I still need to read the gigantic XML file. It’s far too big to parse, so I’ve come up with a hackish, but good-enough solution for this use case. I’ll just find all text between <text xml:space="preserve"> and </text>. These are fixed strings, so I’m using aho_corasick to find the indexes where the first substring ends and the second substring begins, which ultimately tells me where the (beginning, end) of the raw wikitext for each article are.

So at this point I have a Vec<(usize, usize)>. What I’m planning to do is reopen the file and grab slice by slice using the pairs in the vector. How can I slice into a file stream dynamically like this?

To try to clarify a bit, here’s just the slice gathering code in a compilable unit:

extern crate aho_corasick;

use aho_corasick::{AcAutomaton, Automaton, Match};
use std::fs::File;

fn main() {
    let aut = AcAutomaton::new(vec!["<text xml:space=\"preserve\">", "</text>"]);

    let f = File::open("/home/alexander/Downloads/enwiki-latest-pages-articles-multistream.xml")
        .unwrap();

    let mut it = aut.stream_find(f);

    let mut relevant_slices = vec![];

    let mut slice_beginning = 0;

    while let Some(Ok(m)) = it.next() {
        if m.pati == 0 {
            slice_beginning = m.end;
        } else {
            let slice_end = m.start;
            relevant_slices.push((slice_beginning, slice_end));
        }
    }
}

Here’s what I’ve tried from there:

error[E0608]: cannot index into a value of type `std::fs::File`
  --> src/main.rs:39:17
   |
39 |         let this_slice = f[slice_beginning..slice_end];
   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This doesn’t work. However, I again can’t load the whole file into memory, so I’m not sure what to do.


#2

Try memmap. It derefs to &[u8], and you should be able to feed that resulting slice to AcAutomaton::stream_find. Then use its byte indices to subslice into the mmap, and analyze the data.


#3

Does that lazy load pages into memory while streaming?

I’ll give it a shot.

P.S. thank you!


#4

Yeah, in short, it’ll demand-page the file in via page faults as you go along.