Fetching wikipedia text for training word2vec

zeroexcuses · January 13, 2019, 11:49pm

Word2vec - Wikipedia is a common NLP model that requires a large corpus of text to train.
Wikipedia is a large corpus of text.
I am aware of https://github.com/portstrom/parse_wiki_text/tree/master/examples/test
It appears the bulk of the problem is that mediawiki was a format that evolved over time, and not with parseable-by-anything-besides-mediawiki with a requirement in mind.
Anyone have experience extracting Wikipedia as a text file (or a tree)? I have no problem downloading the data. Parsing it is another issue, and I would prefer to not have to hit Wikipedia.org itself for HTML for every page I parse.

Thanks!

jbowles · January 14, 2019, 1:43am

Original word2vec code has a script to clean wiki... Could use that as a starting point:

tmikolov/word2vec/blob/master/demo-train-big-model-v1.sh

###############################################################################################
#
# Script for training good word and phrase vector model using public corpora, version 1.0.
# The training time will be from several hours to about a day.
#
# Downloads about 8 billion words, makes phrases using two runs of word2phrase, trains
# a 500-dimensional vector model and evaluates it on word and phrase analogy tasks.
#
###############################################################################################

# This function will convert text to lowercase and remove special characters
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}

mkdir word2vec
cd word2vec

This file has been truncated. show original

scottmcm · January 14, 2019, 6:07am

Torrent an official dump: Wikipedia:Database download - Wikipedia

zeroexcuses · January 14, 2019, 6:18am

@scottmcm : Have you worked with wikipedia dumps in the past? DB dumps are available at https://dumps.wikimedia.org/ in XML format.

The problem is not getting the dump. The problem is that the recent dumps are in XML, not rendered-HTML.

How do you know that the torrents provide HTML instead of XML ?

zeroexcuses · January 14, 2019, 6:20am

@jbowles : I tried that script. In the first 1.5 GB data dump of data.txt, the text looks quite noisy. It's impressive that word2vec can work on such noisy data, but I'd prefer to extract clean text if possible.

jbowles · January 15, 2019, 1:12am

@zeroexcuses I hear ya. I happened to see this today: Preprocessed Wikipedia for HotpotQA

Haven't yet used it, maybe it fills your need?

jbowles · January 15, 2019, 2:03am

Forgot I also have an old experiment text parser (in go) that would clean wkipedia dataset. I'll run through that and gist examples.

zeroexcuses · January 15, 2019, 3:06am

@jbowles : the HotpotQA wikipedia data is excellent. Thanks!

Topic		Replies	Views
Parse Wiki Text released announcements	2	1136	August 1, 2021
Help updating a wikimedia-editing tool written in Rust help	5	401	February 8, 2022
The view of Wikidata on rust implemented software community	4	966	March 4, 2020
"Rebuilding" some content in a file help	6	325	March 12, 2021
Interest for NLP in Rust?	18	11731	December 21, 2020

Fetching wikipedia text for training word2vec

Related Topics