Chapter 19 NLTK

  • The almighty nltk package!

19.1 Installation

  • Install package in the terminal
!pip install nltk
  • Download nltk data in python
import nltk
nltk.download('all', halt_on_error=False)
import nltk
# nltk.download()

The complete collection of the nltk.corpus is huge. You probably don’t need all of the corpora data. You can use nltk.download() to initialize a User Window for installation of specific datasets.

19.2 Corpora Data

  • The package includes a lot of pre-loaded corpora datasets
  • The default nltk_data directory is in /Users/YOUT_NAME/nltk_data/
  • Selective Examples
    • Brown Corpus
    • Reuters Corpus
    • WordNet
from nltk.corpus import gutenberg, brown, reuters
# brown corpus
## Categories (topics?)
print('Brown Corpus Total Categories: ', len(brown.categories()))
Brown Corpus Total Categories:  15
print('Categories List: ', brown.categories())
Categories List:  ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
# Sentences
print(brown.sents()[0]) ## first sentence
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
print(brown.sents(categories='fiction')) ## first sentence for fiction texts
[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]
## Tagged Sentences
print(brown.tagged_sents()[0])
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]
## Sentence in natural forms
sents = brown.sents(categories='fiction')
[' '.join(sent) for sent in sents[1:5]]
['Scotty did not go back to school .', 'His parents talked seriously and lengthily to their own doctor and to a specialist at the University Hospital -- Mr. McKinley was entitled to a discount for members of his family -- and it was decided it would be best for him to take the remainder of the term off , spend a lot of time in bed and , for the rest , do pretty much as he chose -- provided , of course , he chose to do nothing too exciting or too debilitating .', 'His teacher and his school principal were conferred with and everyone agreed that , if he kept up with a certain amount of work at home , there was little danger of his losing a term .', 'Scotty accepted the decision with indifference and did not enter the arguments .']
## Get tagged words
tagged_words = brown.tagged_words(categories='fiction')

#print(tagged_words[1]) ## a tuple

## Get all nouns 
nouns = [(word, tag) for word, tag in tagged_words 
                      if any (noun_tag in tag for noun_tag in ['NP','NN'])]
## Check first ten nouns
nouns[:10]
[('Scotty', 'NP'), ('school', 'NN'), ('parents', 'NNS'), ('doctor', 'NN'), ('specialist', 'NN'), ('University', 'NN-TL'), ('Hospital', 'NN-TL'), ('Mr.', 'NP'), ('McKinley', 'NP'), ('discount', 'NN')]
## Creating Freq list
nouns_freq = nltk.FreqDist([w for w, t in nouns])
sorted(nouns_freq.items(),key=lambda x:x[1], reverse=True)[:20]
[('man', 111), ('time', 99), ('men', 72), ('room', 63), ('way', 62), ('eyes', 60), ('face', 55), ('house', 54), ('head', 54), ('night', 53), ('day', 52), ('hand', 50), ('door', 47), ('life', 44), ('years', 44), ('Mrs.', 41), ('God', 41), ('Kate', 40), ('Mr.', 39), ('people', 39)]
sorted(nouns_freq.items(),key=lambda x:x[0], reverse=True)[:20]
[('zoo', 2), ('zlotys', 1), ('zenith', 1), ('youth', 5), ('yelling', 1), ('years', 44), ('yearning', 1), ("year's", 1), ('year', 9), ('yards', 4), ('yard', 7), ('yachts', 1), ('writing', 2), ('writers', 1), ('writer', 4), ('wrists', 1), ('wrist', 2), ('wrinkles', 1), ('wrinkle', 1), ('wretch', 1)]
nouns_freq.most_common(10)
[('man', 111), ('time', 99), ('men', 72), ('room', 63), ('way', 62), ('eyes', 60), ('face', 55), ('house', 54), ('head', 54), ('night', 53)]
## Accsess data via fileid
brown.fileids(categories='fiction')[0]
'ck01'
brown.sents(fileids='ck01')
[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]

19.3 WordNet

19.3.1 A Dictionary Resource

from nltk.corpus import wordnet as wn
word = 'walk'

# get synsets
word_synsets = wn.synsets(word, pos='v')
word_synsets
[Synset('walk.v.01'), Synset('walk.v.02'), Synset('walk.v.03'), Synset('walk.v.04'), Synset('walk.v.05'), Synset('walk.v.06'), Synset('walk.v.07'), Synset('walk.v.08'), Synset('walk.v.09'), Synset('walk.v.10')]

Word sense is closely connected to its parts-of-speech. Therefore, in WordNet it is crucial to specify the POS tag of the word to obtain the correct synset of the word. There are four common part-of-speech tags in WordNet, as shown below:

Part of Speech Tag
Noun n
Adjective a
Adverb r
Verb v
  • Check the definition of a synset (i.e., a specific sense of the word):
word_synsets[0].definition()
"use one's feet to advance; advance by steps"
  • Check the examples of a synset:
word_synsets[0].examples()
["Walk, don't run!", 'We walked instead of driving', 'She walks with a slight limp', 'The patient cannot walk yet', 'Walk over to the cabinet']
## Get details of each synset
for s in word_synsets:
    if str(s.name()).startswith('walk.v'):
        print(
            'Syset ID: %s \n'
            'POS Tag: %s \n'
            'Definition: %s \n'
            'Examples: %s \n' % (s.name(), s.pos(), s.definition(),s.examples())
        )
Syset ID: walk.v.01 
POS Tag: v 
Definition: use one's feet to advance; advance by steps 
Examples: ["Walk, don't run!", 'We walked instead of driving', 'She walks with a slight limp', 'The patient cannot walk yet', 'Walk over to the cabinet'] 

Syset ID: walk.v.02 
POS Tag: v 
Definition: accompany or escort 
Examples: ["I'll walk you to your car"] 

Syset ID: walk.v.03 
POS Tag: v 
Definition: obtain a base on balls 
Examples: [] 

Syset ID: walk.v.04 
POS Tag: v 
Definition: traverse or cover by walking 
Examples: ['Walk the tightrope', 'Paul walked the streets of Damascus', 'She walks 3 miles every day'] 

Syset ID: walk.v.05 
POS Tag: v 
Definition: give a base on balls to 
Examples: [] 

Syset ID: walk.v.06 
POS Tag: v 
Definition: live or behave in a specified manner 
Examples: ['walk in sadness'] 

Syset ID: walk.v.07 
POS Tag: v 
Definition: be or act in association with 
Examples: ['We must walk with our dispossessed brothers and sisters', 'Walk with God'] 

Syset ID: walk.v.08 
POS Tag: v 
Definition: walk at a pace 
Examples: ['The horses walked across the meadow'] 

Syset ID: walk.v.09 
POS Tag: v 
Definition: make walk 
Examples: ['He walks the horse up the mountain', 'Walk the dog twice a day'] 

Syset ID: walk.v.10 
POS Tag: v 
Definition: take a walk; go for a walk; walk for pleasure 
Examples: ['The lovers held hands while walking', 'We like to walk every Sunday'] 

19.3.2 Lexical Relations

  • Extract words that maintain some lexical relations with the synset:
word_synsets[0].hypernyms() # hypernym
[Synset('travel.v.01')]
word_synsets[0].hypernyms()[0].hyponyms() # similar words
[Synset('accompany.v.02'), Synset('advance.v.01'), Synset('angle.v.01'), Synset('ascend.v.01'), Synset('automobile.v.01'), Synset('back.v.02'), Synset('bang.v.04'), Synset('beetle.v.02'), Synset('betake_oneself.v.01'), Synset('billow.v.02'), Synset('bounce.v.03'), Synset('breeze.v.02'), Synset('caravan.v.01'), Synset('career.v.01'), Synset('carry.v.36'), Synset('circle.v.01'), Synset('circle.v.02'), Synset('circuit.v.01'), Synset('circulate.v.07'), Synset('come.v.01'), Synset('come.v.11'), Synset('crawl.v.01'), Synset('cruise.v.02'), Synset('derail.v.02'), Synset('descend.v.01'), Synset('do.v.13'), Synset('drag.v.04'), Synset('draw.v.12'), Synset('drive.v.02'), Synset('drive.v.14'), Synset('ease.v.01'), Synset('fall.v.01'), Synset('fall.v.15'), Synset('ferry.v.03'), Synset('float.v.01'), Synset('float.v.02'), Synset('float.v.05'), Synset('flock.v.01'), Synset('fly.v.01'), Synset('fly.v.06'), Synset('follow.v.01'), Synset('follow.v.04'), Synset('forge.v.05'), Synset('get_around.v.04'), Synset('ghost.v.01'), Synset('glide.v.01'), Synset('go_around.v.02'), Synset('hiss.v.02'), Synset('hurtle.v.01'), Synset('island_hop.v.01'), Synset('lance.v.01'), Synset('lurch.v.03'), Synset('outflank.v.01'), Synset('pace.v.02'), Synset('pan.v.01'), Synset('pass.v.01'), Synset('pass_over.v.04'), Synset('play.v.09'), Synset('plow.v.03'), Synset('prance.v.02'), Synset('precede.v.04'), Synset('precess.v.01'), Synset('proceed.v.02'), Synset('propagate.v.02'), Synset('pursue.v.02'), Synset('push.v.09'), Synset('raft.v.02'), Synset('repair.v.03'), Synset('retreat.v.02'), Synset('retrograde.v.02'), Synset('return.v.01'), Synset('ride.v.01'), Synset('ride.v.04'), Synset('ride.v.10'), Synset('rise.v.01'), Synset('roll.v.12'), Synset('round.v.01'), Synset('run.v.11'), Synset('run.v.34'), Synset('rush.v.01'), Synset('scramble.v.01'), Synset('seek.v.04'), Synset('shuttle.v.01'), Synset('sift.v.01'), Synset('ski.v.01'), Synset('slice_into.v.01'), Synset('slither.v.01'), Synset('snowshoe.v.01'), Synset('speed.v.04'), Synset('steamer.v.01'), Synset('step.v.01'), Synset('step.v.02'), Synset('step.v.06'), Synset('stray.v.02'), Synset('swap.v.02'), Synset('swash.v.01'), Synset('swim.v.01'), Synset('swim.v.05'), Synset('swing.v.03'), Synset('taxi.v.01'), Synset('trail.v.03'), Synset('tram.v.01'), Synset('transfer.v.06'), Synset('travel.v.04'), Synset('travel.v.05'), Synset('travel.v.06'), Synset('travel_by.v.01'), Synset('travel_purposefully.v.01'), Synset('travel_rapidly.v.01'), Synset('trundle.v.01'), Synset('turn.v.06'), Synset('walk.v.01'), Synset('walk.v.10'), Synset('weave.v.04'), Synset('wend.v.01'), Synset('wheel.v.03'), Synset('whine.v.01'), Synset('whish.v.02'), Synset('whisk.v.02'), Synset('whistle.v.02'), Synset('withdraw.v.01'), Synset('zigzag.v.01'), Synset('zoom.v.02')]
word_synsets[0].root_hypernyms() # root 
[Synset('travel.v.01')]
word_synsets[0].hypernym_paths() # from root to this synset in WordNet
[[Synset('travel.v.01'), Synset('walk.v.01')]]
  • Synonyms
# Collect synonyms of all synsets
synonyms = []
for syn in wn.synsets('book', pos='v'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
len(synonyms)
6
len(set(synonyms))
3
print(set(synonyms))
{'book', 'reserve', 'hold'}
  • Antonyms
# First Synset Lemma String
word_synsets[0].lemmas()[0].name() 
'walk'
# Antonyms of the First Synset
word_synsets[0].lemmas()[0].antonyms()[0]
Lemma('ride.v.02.ride')

While previous taxonomic relations (e.g., hypernymy and hyponymy) are in-between synsets, the synonymy and antonymy relations are in-between lemmas. We need to be very clear about the use of the three variants in WordNet:

  • Word Form
  • Lemma
  • Synset

19.3.3 Semantic Similarity Computation

Synsets are organized in a hypernym tree, which can be used for reasoning about the semantic similarity of two Synsets. The closer the two Synsets are in the tree, the more similar they are.

# syn1 = wn.synsets('walk', pos='v')[0]
syn1 = wn.synset('walk.v.01')
syn2 = wn.synset('toddle.v.01')
syn3 = wn.synset('think.v.01')

syn1.wup_similarity(syn2)
0.8
syn1.wup_similarity(syn3)
0.2857142857142857
syn1.path_similarity(syn2)
0.5
syn1.path_similarity(syn3)
0.16666666666666666
ref = syn1.hypernyms()[0]
syn1.shortest_path_distance(ref)
1
syn2.shortest_path_distance(ref)
2
syn1.shortest_path_distance(syn2)
1
print(ref.definition())
change location; move, travel, or proceed, also metaphorically
print([l.name() for l in ref.lemmas()])
['travel', 'go', 'move', 'locomote']
syn1.hypernym_paths()
[[Synset('travel.v.01'), Synset('walk.v.01')]]
syn2.hypernym_paths()
[[Synset('travel.v.01'), Synset('walk.v.01'), Synset('toddle.v.01')]]
syn3.hypernym_paths()
[[Synset('think.v.03'), Synset('evaluate.v.02'), Synset('think.v.01')]]

The wup_similarity method is short for Wu-Pamler Similarity. It is a scoring method for how similar the word senses are based on where the Synsets occur relative to each other in the hypernym tree. For more information or other scoring methods, please check NLTK WordNet documentation.

19.4 Discovering Word Collocations

from nltk.corpus import brown
from nltk.collocations import BigramCollocationFinder

from nltk.metrics import BigramAssocMeasures

words = [w.lower() for w in brown.words()]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[(';', ';'), ('?', '?'), ('of', 'the'), ('.', '``')]
# deal with stopwords
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
## Fitler critera:
## remove words whose length < 3 or which are on the stop word list
filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[('united', 'states'), ('new', 'york'), ('per', 'cent'), ('years', 'ago'), ('rhode', 'island'), ('los', 'angeles'), ('peace', 'corps'), ('san', 'francisco'), ('high', 'school'), ('fiscal', 'year')]
## apply freq-based filter
bcf.apply_freq_filter(3)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[('united', 'states'), ('new', 'york'), ('per', 'cent'), ('years', 'ago'), ('rhode', 'island'), ('los', 'angeles'), ('peace', 'corps'), ('san', 'francisco'), ('high', 'school'), ('fiscal', 'year')]

Exercise 19.1 Try to identify bigram collocations in the corpus, Alice in the Wonderland. The texts are available in nltk.corpus.gutenburg.

Exercise 19.2 Following the same strategy of bigram collocation extraction, please try to extract trigrams from the brown corpus. Remove stop words and short words as we did in the lecture. Include only trigrams whose frequency > 5.

19.5 Tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = """Three blind mice! See how they run!
They all ran after the farmer's wife,
Who cut off their tails with a carving knife.
Did you ever see such a thing in your life
As three blind mice?
"""

text_sent = sent_tokenize(text)
text_word = word_tokenize(text)
text_pos = nltk.pos_tag(text_word)
  • With words, we can create a frequency list:
import pprint as pp
text_fd= nltk.FreqDist([w.lower() for w in text_word])
pp.pprint(text_fd.most_common(10))
[('three', 2),
 ('blind', 2),
 ('mice', 2),
 ('!', 2),
 ('see', 2),
 ('they', 2),
 ('a', 2),
 ('how', 1),
 ('run', 1),
 ('all', 1)]

Exercise 19.3 Provide the word frequency list of the top 30 nouns in Alice in the Wonderland. The raw texts of the novel is available in NTLK (see below).

alice = nltk.corpus.gutenberg.raw(fileids='carroll-alice.txt')
[('Alice', 392),
 ('Queen', 71),
 ('time', 65),
 ('King', 60),
 ('Turtle', 58),
 ('Mock', 56),
 ('Hatter', 55),
 ('*', 54),
 ('Gryphon', 54),
 ('way', 53),
 ('head', 50),
 ('thing', 49),
 ('voice', 47),
 ('Rabbit', 44),
 ('Duchess', 42),
 ('tone', 40),
 ('Dormouse', 40),
 ('March', 34),
 ('moment', 31),
 ('Hare', 31),
 ('nothing', 30),
 ('things', 30),
 ('door', 30),
 ('Mouse', 29),
 ('eyes', 28),
 ('Caterpillar', 27),
 ('day', 25),
 ('course', 25),
 ('Cat', 25),
 ('round', 23)]

19.6 Chinese Word Segmentation

import jieba
text = """
高速公路局說,目前在國道3號北向水上系統至中埔路段車多壅塞,已回堵約3公里。另外,國道1號北向仁德至永康路段路段,已回堵約有7公里。建議駕駛人提前避開壅塞路段改道行駛,行經車多路段請保持行車安全距離,小心行駛。
國道車多壅塞路段還有國1內湖-五堵北向路段、楊梅-新竹南向路段;國3三鶯-關西服務區南向路段、快官-霧峰南向路段、水上系統-中埔北向路段;國6霧峰系統-東草屯東向路段、國10燕巢-燕巢系統東向路段。
"""
text_jb = jieba.lcut(text)
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/70/qfdgs0k52qj24jtjcz7d0dkm0000gn/T/jieba.cache
Loading model cost 0.265 seconds.
Prefix dict has been built successfully.
print(' | '.join(text_jb))

 | 高速公路 | 局說 | , | 目前 | 在 | 國道 | 3 | 號 | 北向 | 水上 | 系統 | 至 | 中埔 | 路段 | 車多 | 壅塞 | , | 已回 | 堵約 | 3 | 公里 | 。 | 另外 | , | 國道 | 1 | 號 | 北向 | 仁德 | 至 | 永康 | 路段 | 路段 | , | 已回 | 堵 | 約 | 有 | 7 | 公里 | 。 | 建議 | 駕駛人 | 提前 | 避開 | 壅塞 | 路段 | 改道 | 行駛 | , | 行經車 | 多 | 路段 | 請 | 保持 | 行車 | 安全 | 距離 | , | 小心 | 行駛 | 。 | 
 | 國道 | 車多 | 壅塞 | 路段 | 還有國 | 1 | 內湖 | - | 五堵 | 北向 | 路段 | 、 | 楊梅 | - | 新竹 | 南向 | 路段 | ; | 國 | 3 | 三鶯 | - | 關 | 西服 | 務區 | 南向 | 路段 | 、 | 快官 | - | 霧峰 | 南向 | 路段 | 、 | 水上 | 系統 | - | 中埔 | 北向 | 路段 | ; | 國 | 6 | 霧峰 | 系統 | - | 東 | 草屯 | 東向 | 路段 | 、 | 國 | 10 | 燕巢 | - | 燕巢 | 系統 | 東向 | 路段 | 。 | 

19.7 Afterwords

  • There are many other aspects that we need to take into account when processing texts. Text segmentation is a non-trivial task in natural language processing. The base units that we work with will turn out to be connected to the specific research questions we aim to asnwer.
  • If you would like to know more about computational text analytics, I would highly recommend you to move on to two more advanced courses:
    • ENC2036 Corpus Linguistics
    • ENC2045 Computational Linguistics