kenkovlog

けんこふたんっオフィシャユブヨグッ
アンッ!アンッ!アンッ!アンッ!

NLP Chapter 1

I bought two books about NLP;

I figured out that they had a lot in common, so I decided to read these books simultaniously. I read the first chapter, Chapter 1, in both books. Therefore, now I summarize it.

Chapter 1

They wrote about the corpus. When we find a corpus, what do we want to do? First of all, many people want to know these following information;

  • How many words are there in the text?
    • How many word tokens are there?
    • How many word types are there?

We can easily carry out these tusks by using NLTK;

>>> form nltk.books import *
>>> # How many tokens are there?
>>> len(text1)
260819
>>> # How many types are there?
>>> len(set(text1))
19317
>>> calculate the lexical diversity
>>> from __future__ import division
>>> len(text1) / len(set(text1))
13

Then we also want to know what words are common in that corpus; owing to this, we perhaps figure out what this corpus was written about. We can use the FreqDist constructor to know this;

>>> fd1 = FreqDist(text1)
>>> # show the 10 items from the beginning
>>> fd1.items()[:10]
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982)]

Zipf’s low

Zipf’s low is familiar; it is a relationship between a frequency of a word and a position of the word in a list which we call rank . Zipf’s low states that there exists k such that

where f is a frequency of a word and r is a rank.

We confirm this Zipf’s low in text1;

>>> fd1 = FreqDist(text1)
>>> lst = [(pos, ky, val) for pos,(ky,val) in enumerate(fd1.items())]
>>> for (pos, ky, val) in lst[::500]:
>>>     print "{pos}, {ky}, {val}: {mul}".format(pos=pos,
...                                              ky=ky, val=val,
...                                              mul=pos*val)
500, New, 47: 23500
1000, somewhere, 23: 23000
1500, pots, 15: 22500
2000, indifferent, 11: 22000
2500, stuck, 9: 22500
3000, drawers, 7: 21000
3500, jot, 6: 21000
4000, equatorial, 5: 20000
4500, Jupiter, 4: 18000
5000, meditation, 4: 20000
5500, Hear, 3: 16500
6000, displayed, 3: 18000
6500, naturalist, 3: 19500
7000, twig, 3: 21000
7500, Plato, 2: 15000
8000, buck, 2: 16000
8500, exhaust, 2: 17000
9000, ladle, 2: 18000
9500, quickest, 2: 19000
10000, tester, 2: 20000
10500, ADDITIONAL, 1: 10500
11000, Darmonodes, 1: 11000
11500, Jesu, 1: 11500
12000, Quoin, 1: 12000
12500, WALLER, 1: 12500
13000, banding, 1: 13000
13500, chewing, 1: 13500
14000, cunningly, 1: 14000
14500, eminently, 1: 14500
15000, frontiers, 1: 15000
15500, imperceptible, 1: 15500
16000, liken, 1: 16000
16500, obstructed, 1: 16500
17000, predestinating, 1: 17000
17500, roundingly, 1: 17500
18000, spermy, 1: 18000
18500, tonic, 1: 18500
19000, vassal, 1: 19000

It seems that the k is not exact; k is uneven. There is more precise low which was found by Mandelbrot;

where P, , and is constants.

allocation

We can also investigate the allocation by the bigram function;

>>> print FreqDist(bigram(text1)).items()[:10]
[((',', 'and'), 2607), (('of', 'the'), 1847), (("'", 's'), 1737), (('in', 'the'), 1120), ((',', 'the'), 908), ((';', 'and'), 853), (('to', 'the'), 712), (('.', 'But'), 596), ((',', 'that'), 584), (('.', '"'), 557)]
けんこふたん