(This article is reposted from my previous blog website, which was posted on May 18 2017)

Nowadays we utilize wiki-data as resource because of its great coverage.
To train the word representation, I have leveraged wiki-data as its input .

The following are the steps to extract wiki-data as preprocessing for machine learning.

Step 1.
Download wiki-data (choose the one end with ‘pages-articles.xml.bz2’)
or cmd : wget enwiki-20170501-pages-articles.xml.bz2

Step 2.
Got wiki extractor:

git clone https://github.com/zhaoshiyu/WikiExtractor.git

Step 3.
cmd:
bzcat wget enwiki-20170501-pages-articles.xml.bz2 | python WikiExtractor-zsy.py -b200M -o extracted > vocabulary.txt

(-b 200M means 200M for each file. the default vaule is 500K)

Finish!

Software Engineer. Interested in NLP, algorithm, Data Science & Cycling.