A Cache is like a short term memory. It’s typically faster than original data source, because it is in memory. You know accessing data from memory is faster than from hard drive.

[ToolKit] VerbNet Api Tutorial

To implement the language understanding, we leverage some resource like syntatic frame and verbnet is one of them.

What’s verbnet?
Verbnet is a class-based verb Lexicon. Each verb in the verbnet is described by it’s semantic role…

Nowadays we utilize wiki-data as resource because of its great coverage.
To train the word representation, I have leveraged wiki-data as its input .

The following are the steps to extract wiki-data as preprocessing for machine learning.

Step 1.
Download wiki-data (choose the one end with ‘pages-articles.xml.bz2’)
or cmd : wget enwiki-20170501-pages-articles.xml.bz2

Step 2.
Got wiki extractor:

git clone https://github.com/zhaoshiyu/WikiExtractor.git

Step 3.
bzcat wget enwiki-20170501-pages-articles.xml.bz2 | python WikiExtractor-zsy.py -b200M -o extracted > vocabulary.txt

(-b 200M means 200M for each file. the default vaule is 500K)


To remember the day that I had shared the concept of Semantic Role Labeling (SRL) on R-Ladies community in Taiwan.

I will write down another post to introduce what’s SRL and what’s the difference between syntactic and semantic. However if you are eager to see what’s the difference between these two, I had written a brief introduction and module for that.

