Chapter 19 NLTK
- The almighty
nltk
package!
19.1 Installation
- Install package in the terminal
!pip install nltk
- Download nltk data in python
import nltk
nltk.download('all', halt_on_error=False)
The complete collection of the nltk.corpus
is huge. You probably don’t need all of the corpora data.
You can use nltk.download()
to initialize a User Window for installation of specific datasets.
19.2 Corpora Data
- The package includes a lot of pre-loaded corpora datasets
- The default
nltk_data
directory is in/Users/YOUT_NAME/nltk_data/
- Selective Examples
- Brown Corpus
- Reuters Corpus
- WordNet
from nltk.corpus import gutenberg, brown, reuters
# brown corpus
## Categories (topics?)
print('Brown Corpus Total Categories: ', len(brown.categories()))
Brown Corpus Total Categories: 15
Categories List: ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]
## Sentence in natural forms
sents = brown.sents(categories='fiction')
[' '.join(sent) for sent in sents[1:5]]
['Scotty did not go back to school .', 'His parents talked seriously and lengthily to their own doctor and to a specialist at the University Hospital -- Mr. McKinley was entitled to a discount for members of his family -- and it was decided it would be best for him to take the remainder of the term off , spend a lot of time in bed and , for the rest , do pretty much as he chose -- provided , of course , he chose to do nothing too exciting or too debilitating .', 'His teacher and his school principal were conferred with and everyone agreed that , if he kept up with a certain amount of work at home , there was little danger of his losing a term .', 'Scotty accepted the decision with indifference and did not enter the arguments .']
## Get tagged words
tagged_words = brown.tagged_words(categories='fiction')
#print(tagged_words[1]) ## a tuple
## Get all nouns
nouns = [(word, tag) for word, tag in tagged_words
if any (noun_tag in tag for noun_tag in ['NP','NN'])]
## Check first ten nouns
nouns[:10]
[('Scotty', 'NP'), ('school', 'NN'), ('parents', 'NNS'), ('doctor', 'NN'), ('specialist', 'NN'), ('University', 'NN-TL'), ('Hospital', 'NN-TL'), ('Mr.', 'NP'), ('McKinley', 'NP'), ('discount', 'NN')]
## Creating Freq list
nouns_freq = nltk.FreqDist([w for w, t in nouns])
sorted(nouns_freq.items(),key=lambda x:x[1], reverse=True)[:20]
[('man', 111), ('time', 99), ('men', 72), ('room', 63), ('way', 62), ('eyes', 60), ('face', 55), ('house', 54), ('head', 54), ('night', 53), ('day', 52), ('hand', 50), ('door', 47), ('life', 44), ('years', 44), ('Mrs.', 41), ('God', 41), ('Kate', 40), ('Mr.', 39), ('people', 39)]
[('zoo', 2), ('zlotys', 1), ('zenith', 1), ('youth', 5), ('yelling', 1), ('years', 44), ('yearning', 1), ("year's", 1), ('year', 9), ('yards', 4), ('yard', 7), ('yachts', 1), ('writing', 2), ('writers', 1), ('writer', 4), ('wrists', 1), ('wrist', 2), ('wrinkles', 1), ('wrinkle', 1), ('wretch', 1)]
[('man', 111), ('time', 99), ('men', 72), ('room', 63), ('way', 62), ('eyes', 60), ('face', 55), ('house', 54), ('head', 54), ('night', 53)]
'ck01'
[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]
19.3 WordNet
19.3.1 A Dictionary Resource
from nltk.corpus import wordnet as wn
word = 'walk'
# get synsets
word_synsets = wn.synsets(word, pos='v')
word_synsets
[Synset('walk.v.01'), Synset('walk.v.02'), Synset('walk.v.03'), Synset('walk.v.04'), Synset('walk.v.05'), Synset('walk.v.06'), Synset('walk.v.07'), Synset('walk.v.08'), Synset('walk.v.09'), Synset('walk.v.10')]
Word sense is closely connected to its parts-of-speech. Therefore, in WordNet it is crucial to specify the POS tag of the word to obtain the correct synset of the word. There are four common part-of-speech tags in WordNet, as shown below:
Part of Speech | Tag |
---|---|
Noun | n |
Adjective | a |
Adverb | r |
Verb | v |
- Check the definition of a synset (i.e., a specific sense of the word):
"use one's feet to advance; advance by steps"
- Check the examples of a synset:
["Walk, don't run!", 'We walked instead of driving', 'She walks with a slight limp', 'The patient cannot walk yet', 'Walk over to the cabinet']
## Get details of each synset
for s in word_synsets:
if str(s.name()).startswith('walk.v'):
print(
'Syset ID: %s \n'
'POS Tag: %s \n'
'Definition: %s \n'
'Examples: %s \n' % (s.name(), s.pos(), s.definition(),s.examples())
)
Syset ID: walk.v.01
POS Tag: v
Definition: use one's feet to advance; advance by steps
Examples: ["Walk, don't run!", 'We walked instead of driving', 'She walks with a slight limp', 'The patient cannot walk yet', 'Walk over to the cabinet']
Syset ID: walk.v.02
POS Tag: v
Definition: accompany or escort
Examples: ["I'll walk you to your car"]
Syset ID: walk.v.03
POS Tag: v
Definition: obtain a base on balls
Examples: []
Syset ID: walk.v.04
POS Tag: v
Definition: traverse or cover by walking
Examples: ['Walk the tightrope', 'Paul walked the streets of Damascus', 'She walks 3 miles every day']
Syset ID: walk.v.05
POS Tag: v
Definition: give a base on balls to
Examples: []
Syset ID: walk.v.06
POS Tag: v
Definition: live or behave in a specified manner
Examples: ['walk in sadness']
Syset ID: walk.v.07
POS Tag: v
Definition: be or act in association with
Examples: ['We must walk with our dispossessed brothers and sisters', 'Walk with God']
Syset ID: walk.v.08
POS Tag: v
Definition: walk at a pace
Examples: ['The horses walked across the meadow']
Syset ID: walk.v.09
POS Tag: v
Definition: make walk
Examples: ['He walks the horse up the mountain', 'Walk the dog twice a day']
Syset ID: walk.v.10
POS Tag: v
Definition: take a walk; go for a walk; walk for pleasure
Examples: ['The lovers held hands while walking', 'We like to walk every Sunday']
19.3.2 Lexical Relations
- Extract words that maintain some lexical relations with the synset:
[Synset('travel.v.01')]
[Synset('accompany.v.02'), Synset('advance.v.01'), Synset('angle.v.01'), Synset('ascend.v.01'), Synset('automobile.v.01'), Synset('back.v.02'), Synset('bang.v.04'), Synset('beetle.v.02'), Synset('betake_oneself.v.01'), Synset('billow.v.02'), Synset('bounce.v.03'), Synset('breeze.v.02'), Synset('caravan.v.01'), Synset('career.v.01'), Synset('carry.v.36'), Synset('circle.v.01'), Synset('circle.v.02'), Synset('circuit.v.01'), Synset('circulate.v.07'), Synset('come.v.01'), Synset('come.v.11'), Synset('crawl.v.01'), Synset('cruise.v.02'), Synset('derail.v.02'), Synset('descend.v.01'), Synset('do.v.13'), Synset('drag.v.04'), Synset('draw.v.12'), Synset('drive.v.02'), Synset('drive.v.14'), Synset('ease.v.01'), Synset('fall.v.01'), Synset('fall.v.15'), Synset('ferry.v.03'), Synset('float.v.01'), Synset('float.v.02'), Synset('float.v.05'), Synset('flock.v.01'), Synset('fly.v.01'), Synset('fly.v.06'), Synset('follow.v.01'), Synset('follow.v.04'), Synset('forge.v.05'), Synset('get_around.v.04'), Synset('ghost.v.01'), Synset('glide.v.01'), Synset('go_around.v.02'), Synset('hiss.v.02'), Synset('hurtle.v.01'), Synset('island_hop.v.01'), Synset('lance.v.01'), Synset('lurch.v.03'), Synset('outflank.v.01'), Synset('pace.v.02'), Synset('pan.v.01'), Synset('pass.v.01'), Synset('pass_over.v.04'), Synset('play.v.09'), Synset('plow.v.03'), Synset('prance.v.02'), Synset('precede.v.04'), Synset('precess.v.01'), Synset('proceed.v.02'), Synset('propagate.v.02'), Synset('pursue.v.02'), Synset('push.v.09'), Synset('raft.v.02'), Synset('repair.v.03'), Synset('retreat.v.02'), Synset('retrograde.v.02'), Synset('return.v.01'), Synset('ride.v.01'), Synset('ride.v.04'), Synset('ride.v.10'), Synset('rise.v.01'), Synset('roll.v.12'), Synset('round.v.01'), Synset('run.v.11'), Synset('run.v.34'), Synset('rush.v.01'), Synset('scramble.v.01'), Synset('seek.v.04'), Synset('shuttle.v.01'), Synset('sift.v.01'), Synset('ski.v.01'), Synset('slice_into.v.01'), Synset('slither.v.01'), Synset('snowshoe.v.01'), Synset('speed.v.04'), Synset('steamer.v.01'), Synset('step.v.01'), Synset('step.v.02'), Synset('step.v.06'), Synset('stray.v.02'), Synset('swap.v.02'), Synset('swash.v.01'), Synset('swim.v.01'), Synset('swim.v.05'), Synset('swing.v.03'), Synset('taxi.v.01'), Synset('trail.v.03'), Synset('tram.v.01'), Synset('transfer.v.06'), Synset('travel.v.04'), Synset('travel.v.05'), Synset('travel.v.06'), Synset('travel_by.v.01'), Synset('travel_purposefully.v.01'), Synset('travel_rapidly.v.01'), Synset('trundle.v.01'), Synset('turn.v.06'), Synset('walk.v.01'), Synset('walk.v.10'), Synset('weave.v.04'), Synset('wend.v.01'), Synset('wheel.v.03'), Synset('whine.v.01'), Synset('whish.v.02'), Synset('whisk.v.02'), Synset('whistle.v.02'), Synset('withdraw.v.01'), Synset('zigzag.v.01'), Synset('zoom.v.02')]
[Synset('travel.v.01')]
[[Synset('travel.v.01'), Synset('walk.v.01')]]
- Synonyms
# Collect synonyms of all synsets
synonyms = []
for syn in wn.synsets('book', pos='v'):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
len(synonyms)
6
3
{'book', 'reserve', 'hold'}
- Antonyms
'walk'
Lemma('ride.v.02.ride')
While previous taxonomic relations (e.g., hypernymy and hyponymy) are in-between synsets, the synonymy and antonymy relations are in-between lemmas. We need to be very clear about the use of the three variants in WordNet:
- Word Form
- Lemma
- Synset
19.3.3 Semantic Similarity Computation
Synsets are organized in a hypernym tree, which can be used for reasoning about the semantic similarity of two Synsets. The closer the two Synsets are in the tree, the more similar they are.
# syn1 = wn.synsets('walk', pos='v')[0]
syn1 = wn.synset('walk.v.01')
syn2 = wn.synset('toddle.v.01')
syn3 = wn.synset('think.v.01')
syn1.wup_similarity(syn2)
0.8
0.2857142857142857
0.5
0.16666666666666666
1
2
1
change location; move, travel, or proceed, also metaphorically
['travel', 'go', 'move', 'locomote']
[[Synset('travel.v.01'), Synset('walk.v.01')]]
[[Synset('travel.v.01'), Synset('walk.v.01'), Synset('toddle.v.01')]]
[[Synset('think.v.03'), Synset('evaluate.v.02'), Synset('think.v.01')]]
The wup_similarity
method is short for Wu-Pamler Similarity. It is a scoring method for how similar the word senses are based on where the Synsets occur relative to each other in the hypernym tree. For more information or other scoring methods, please check NLTK WordNet documentation.
19.4 Discovering Word Collocations
from nltk.corpus import brown
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
words = [w.lower() for w in brown.words()]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[(';', ';'), ('?', '?'), ('of', 'the'), ('.', '``')]
# deal with stopwords
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
## Fitler critera:
## remove words whose length < 3 or which are on the stop word list
filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[('united', 'states'), ('new', 'york'), ('per', 'cent'), ('years', 'ago'), ('rhode', 'island'), ('los', 'angeles'), ('peace', 'corps'), ('san', 'francisco'), ('high', 'school'), ('fiscal', 'year')]
## apply freq-based filter
bcf.apply_freq_filter(3)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[('united', 'states'), ('new', 'york'), ('per', 'cent'), ('years', 'ago'), ('rhode', 'island'), ('los', 'angeles'), ('peace', 'corps'), ('san', 'francisco'), ('high', 'school'), ('fiscal', 'year')]
Exercise 19.1 Try to identify bigram collocations in the corpus, Alice in the Wonderland. The texts are available in nltk.corpus.gutenburg
.
Exercise 19.2 Following the same strategy of bigram collocation extraction, please try to extract trigrams from the brown
corpus. Remove stop words and short words as we did in the lecture. Include only trigrams whose frequency > 5.
19.5 Tokenization
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = """Three blind mice! See how they run!
They all ran after the farmer's wife,
Who cut off their tails with a carving knife.
Did you ever see such a thing in your life
As three blind mice?
"""
text_sent = sent_tokenize(text)
text_word = word_tokenize(text)
text_pos = nltk.pos_tag(text_word)
- With words, we can create a frequency list:
import pprint as pp
text_fd= nltk.FreqDist([w.lower() for w in text_word])
pp.pprint(text_fd.most_common(10))
[('three', 2),
('blind', 2),
('mice', 2),
('!', 2),
('see', 2),
('they', 2),
('a', 2),
('how', 1),
('run', 1),
('all', 1)]
Exercise 19.3 Provide the word frequency list of the top 30 nouns in Alice in the Wonderland. The raw texts of the novel is available in NTLK (see below).
[('Alice', 392),
('Queen', 71),
('time', 65),
('King', 60),
('Turtle', 58),
('Mock', 56),
('Hatter', 55),
('*', 54),
('Gryphon', 54),
('way', 53),
('head', 50),
('thing', 49),
('voice', 47),
('Rabbit', 44),
('Duchess', 42),
('tone', 40),
('Dormouse', 40),
('March', 34),
('moment', 31),
('Hare', 31),
('nothing', 30),
('things', 30),
('door', 30),
('Mouse', 29),
('eyes', 28),
('Caterpillar', 27),
('day', 25),
('course', 25),
('Cat', 25),
('round', 23)]
19.6 Chinese Word Segmentation
import jieba
text = """
高速公路局說,目前在國道3號北向水上系統至中埔路段車多壅塞,已回堵約3公里。另外,國道1號北向仁德至永康路段路段,已回堵約有7公里。建議駕駛人提前避開壅塞路段改道行駛,行經車多路段請保持行車安全距離,小心行駛。
國道車多壅塞路段還有國1內湖-五堵北向路段、楊梅-新竹南向路段;國3三鶯-關西服務區南向路段、快官-霧峰南向路段、水上系統-中埔北向路段;國6霧峰系統-東草屯東向路段、國10燕巢-燕巢系統東向路段。
"""
text_jb = jieba.lcut(text)
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/70/qfdgs0k52qj24jtjcz7d0dkm0000gn/T/jieba.cache
Loading model cost 0.265 seconds.
Prefix dict has been built successfully.
| 高速公路 | 局說 | , | 目前 | 在 | 國道 | 3 | 號 | 北向 | 水上 | 系統 | 至 | 中埔 | 路段 | 車多 | 壅塞 | , | 已回 | 堵約 | 3 | 公里 | 。 | 另外 | , | 國道 | 1 | 號 | 北向 | 仁德 | 至 | 永康 | 路段 | 路段 | , | 已回 | 堵 | 約 | 有 | 7 | 公里 | 。 | 建議 | 駕駛人 | 提前 | 避開 | 壅塞 | 路段 | 改道 | 行駛 | , | 行經車 | 多 | 路段 | 請 | 保持 | 行車 | 安全 | 距離 | , | 小心 | 行駛 | 。 |
| 國道 | 車多 | 壅塞 | 路段 | 還有國 | 1 | 內湖 | - | 五堵 | 北向 | 路段 | 、 | 楊梅 | - | 新竹 | 南向 | 路段 | ; | 國 | 3 | 三鶯 | - | 關 | 西服 | 務區 | 南向 | 路段 | 、 | 快官 | - | 霧峰 | 南向 | 路段 | 、 | 水上 | 系統 | - | 中埔 | 北向 | 路段 | ; | 國 | 6 | 霧峰 | 系統 | - | 東 | 草屯 | 東向 | 路段 | 、 | 國 | 10 | 燕巢 | - | 燕巢 | 系統 | 東向 | 路段 | 。 |
19.7 Afterwords
- There are many other aspects that we need to take into account when processing texts. Text segmentation is a non-trivial task in natural language processing. The base units that we work with will turn out to be connected to the specific research questions we aim to asnwer.
- If you would like to know more about computational text analytics, I would highly recommend you to move on to two more advanced courses:
- ENC2036 Corpus Linguistics
- ENC2045 Computational Linguistics