It appears the bulk of the problem is that mediawiki was a format that evolved over time, and not with parseable-by-anything-besides-mediawiki with a requirement in mind.
Anyone have experience extracting Wikipedia as a text file (or a tree)? I have no problem downloading the data. Parsing it is another issue, and I would prefer to not have to hit Wikipedia.org itself for HTML for every page I parse.
@jbowles : I tried that script. In the first 1.5 GB data dump of data.txt, the text looks quite noisy. It's impressive that word2vec can work on such noisy data, but I'd prefer to extract clean text if possible.