Fetching wikipedia text for training word2vec


#1
  1. https://en.wikipedia.org/wiki/Word2vec is a common NLP model that requires a large corpus of text to train.

  2. Wikipedia is a large corpus of text.

  3. I am aware of https://github.com/portstrom/parse_wiki_text/tree/master/examples/test

  4. It appears the bulk of the problem is that mediawiki was a format that evolved over time, and not with parseable-by-anything-besides-mediawiki with a requirement in mind.

  5. Anyone have experience extracting Wikipedia as a text file (or a tree)? I have no problem downloading the data. Parsing it is another issue, and I would prefer to not have to hit Wikipedia.org itself for HTML for every page I parse.

Thanks!


#2

Original word2vec code has a script to clean wiki… Could use that as a starting point:


#3

Torrent an official dump: https://en.wikipedia.org/wiki/Wikipedia:Database_download


#4

@scottmcm : Have you worked with wikipedia dumps in the past? DB dumps are available at https://dumps.wikimedia.org/ in XML format.

The problem is not getting the dump. The problem is that the recent dumps are in XML, not rendered-HTML.

How do you know that the torrents provide HTML instead of XML ?


#5

@jbowles : I tried that script. In the first 1.5 GB data dump of data.txt, the text looks quite noisy. It’s impressive that word2vec can work on such noisy data, but I’d prefer to extract clean text if possible.


#6

@zeroexcuses I hear ya. I happened to see this today: Preprocessed Wikipedia for HotpotQA

Haven’t yet used it, maybe it fills your need?


#7

Forgot I also have an old experiment text parser (in go) that would clean wkipedia dataset. I’ll run through that and gist examples.


#8

@jbowles : the HotpotQA wikipedia data is excellent. Thanks!