Lexical Bundles#
This section talks about how to identify recurring multiword sequences from texts, which has received a lot of attention in recent years in language studies.
Loading libraries#
from nltk.corpus import reuters
from nltk import ngrams
from collections import Counter, defaultdict
import re
Corpus Data#
In this demonstration, we use the reuters
corpus as our data source, which has been made available in the nltk
.
## A quick look at the first five sentences
print([' '.join(sent) for sent in reuters.sents()[:5]])
["ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .", 'They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .', "But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .", "The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .", 'Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes .']
Lexical Bundles#
Lexical bundles refer to any contiguous multiword sequences from the texts. Normally, research on lexical bundles examine the multiword sequences of sizes from four- to seven-word sequences.
The idea of lexical bundles is essentially the ngrams in NLP, which N
refers to the size of the multiword sequence.
To extract a meaningful set of lexical bundles, we need to consider at least two important distributional criteria:
Frequency of the bundle: how often does the sequence occur in the entire corpus?
Range of the bundle: in how many different texts/documents does the sequence occur in the entire corpus?
## Number of documents in `reuters`
len(reuters.fileids())
10788
# Create a placeholder for 4gram bundles statistics
bundles_4 = defaultdict(lambda: defaultdict(lambda: 0))
bundles_range = defaultdict(lambda: defaultdict(lambda: 0))
[n for n in ngrams(reuters.sents()[1],n=4)]
[('They', 'told', 'Reuter', 'correspondents'),
('told', 'Reuter', 'correspondents', 'in'),
('Reuter', 'correspondents', 'in', 'Asian'),
('correspondents', 'in', 'Asian', 'capitals'),
('in', 'Asian', 'capitals', 'a'),
('Asian', 'capitals', 'a', 'U'),
('capitals', 'a', 'U', '.'),
('a', 'U', '.', 'S'),
('U', '.', 'S', '.'),
('.', 'S', '.', 'Move'),
('S', '.', 'Move', 'against'),
('.', 'Move', 'against', 'Japan'),
('Move', 'against', 'Japan', 'might'),
('against', 'Japan', 'might', 'boost'),
('Japan', 'might', 'boost', 'protectionist'),
('might', 'boost', 'protectionist', 'sentiment'),
('boost', 'protectionist', 'sentiment', 'in'),
('protectionist', 'sentiment', 'in', 'the'),
('sentiment', 'in', 'the', 'U'),
('in', 'the', 'U', '.'),
('the', 'U', '.', 'S'),
('U', '.', 'S', '.'),
('.', 'S', '.', 'And'),
('S', '.', 'And', 'lead'),
('.', 'And', 'lead', 'to'),
('And', 'lead', 'to', 'curbs'),
('lead', 'to', 'curbs', 'on'),
('to', 'curbs', 'on', 'American'),
('curbs', 'on', 'American', 'imports'),
('on', 'American', 'imports', 'of'),
('American', 'imports', 'of', 'their'),
('imports', 'of', 'their', 'products'),
('of', 'their', 'products', '.')]
%%time
# Count frequency of co-occurance
for fid in reuters.fileids():
temp = defaultdict(lambda: defaultdict(lambda: 0))
for sentence in reuters.sents(fileids=fid):
for w1, w2, w3, w4 in ngrams(sentence, n=4, pad_right=False, pad_left=False):
## filter
if re.match(r'\w+',w1) and re.match(r'\w+',w2) and re.match(r'\w+',w3) and re.match(r'\w+', w4):
bundles_4[(w1, w2, w3)][w4] += 1
temp[(w1, w2, w3)][w4] += 1
# range value
for key, value in temp.items():
for k in value.keys():
bundles_range[key][k] +=1
CPU times: user 19.1 s, sys: 1.15 s, total: 20.2 s
Wall time: 21.8 s
list(bundles_4.items())[:5]
[(('ASIAN', 'EXPORTERS', 'FEAR'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
{'DAMAGE': 1})),
(('EXPORTERS', 'FEAR', 'DAMAGE'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'FROM': 1})),
(('FEAR', 'DAMAGE', 'FROM'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'U': 1})),
(('JAPAN', 'RIFT', 'Mounting'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'trade': 1})),
(('RIFT', 'Mounting', 'trade'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
{'friction': 1}))]
list(bundles_range.items())[:5]
[(('ASIAN', 'EXPORTERS', 'FEAR'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
{'DAMAGE': 1})),
(('EXPORTERS', 'FEAR', 'DAMAGE'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'FROM': 1})),
(('FEAR', 'DAMAGE', 'FROM'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'U': 1})),
(('JAPAN', 'RIFT', 'Mounting'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'trade': 1})),
(('RIFT', 'Mounting', 'trade'),
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
{'friction': 1}))]
Convert to data frames#
For more intuitive reading of the bundles data, we can create a data frame with the distributional information of each bundle type.
Most importantly, we can filter and sort our bundle data nicely and easily with the functionality provided with the data frame.
Create three lists:
w1_w2_w3
: the first three words in the bundlew4
: the last word in the bundlefreq
: freq of the bundlerange
: range of the bundle
%%time
import pandas as pd
w1_w2_w3 = []
w4 = []
freq = []
rangev = []
for _w123 in bundles_4.keys():
for _w4 in bundles_4[_w123].keys():
w1_w2_w3.append('_'.join(_w123))
w4.append(_w4)
freq.append(bundles_4[_w123][_w4])
rangev.append(bundles_range[_w123][_w4])
CPU times: user 1.19 s, sys: 81.7 ms, total: 1.27 s
Wall time: 3.05 s
Check the lengths of the four lists before combining them into a data frame.
print(len(w1_w2_w3))
print(len(w4))
print(len(freq))
691190
691190
691190
Create the bundle data frame.
bundles_df =pd.DataFrame(list(zip(w1_w2_w3, w4, freq, rangev)),
columns=['w123','w4','freq','range'])
bundles_df.head()
w123 | w4 | freq | range | |
---|---|---|---|---|
0 | ASIAN_EXPORTERS_FEAR | DAMAGE | 1 | 1 |
1 | EXPORTERS_FEAR_DAMAGE | FROM | 1 | 1 |
2 | FEAR_DAMAGE_FROM | U | 1 | 1 |
3 | JAPAN_RIFT_Mounting | trade | 1 | 1 |
4 | RIFT_Mounting_trade | friction | 1 | 1 |
Filter bundles whose range
>= 10 and arrange the data frame according to bundles’ range
values.
bundles_df[(bundles_df['range']>=10)].sort_values(['range'], ascending=[False]).head(20)
w123 | w4 | freq | range | |
---|---|---|---|---|
5717 | Securities_and_Exchange | Commission | 275 | 271 |
4813 | said_in_a | statement | 264 | 260 |
5714 | the_Securities_and | Exchange | 258 | 254 |
47163 | 3RD_QTR_NET | Shr | 233 | 233 |
7112 | The_company_said | the | 230 | 211 |
46330 | mln_Nine_mths | Shr | 203 | 203 |
7103 | The_company_said | it | 213 | 197 |
6357 | at_the_end | of | 250 | 178 |
51576 | 4TH_QTR_NET | Shr | 178 | 178 |
60176 | with_the_Securities | and | 162 | 162 |
25083 | cts_prior_Pay | April | 161 | 157 |
11887 | pct_of_the | total | 162 | 156 |
40339 | QTR_LOSS_Shr | loss | 142 | 142 |
24004 | Inc_said_it | has | 141 | 141 |
26751 | 1ST_QTR_NET | Shr | 137 | 137 |
49905 | QTR_JAN_31 | NET | 133 | 133 |
60168 | a_filing_with | the | 130 | 130 |
21944 | said_it_expects | to | 136 | 130 |
33141 | JAN_31_NET | Shr | 129 | 129 |
9673 | The_Bank_of | England | 129 | 126 |
Identify bundles with w4 being either in
or to
.
bundles_df[(bundles_df['range']>=10) & (bundles_df['w4'].isin(['in','to']))].sort_values(['range'], ascending=[False]).head(20)
w123 | w4 | freq | range | |
---|---|---|---|---|
21944 | said_it_expects | to | 136 | 130 |
33219 | said_it_agreed | to | 113 | 111 |
42141 | said_it_plans | to | 84 | 82 |
88616 | agreed_in_principle | to | 75 | 75 |
45606 | letter_of_intent | to | 72 | 71 |
85882 | it_has_agreed | to | 48 | 48 |
60690 | a_definitive_agreement | to | 48 | 48 |
62568 | cts_a_share | in | 65 | 47 |
37697 | dlrs_a_share | in | 54 | 45 |
769 | who_asked_not | to | 41 | 40 |
1603 | in_an_effort | to | 38 | 38 |
85883 | it_has_agreed | in | 35 | 35 |
25651 | will_be_used | to | 34 | 34 |
42760 | 5_mln_dlrs | in | 33 | 33 |
2215 | in_the_year | to | 42 | 33 |
29471 | transaction_is_subject | to | 34 | 32 |
65576 | will_be_able | to | 33 | 31 |
60148 | raised_its_stake | in | 31 | 31 |
33220 | said_it_agreed | in | 32 | 31 |
54249 | dlrs_per_share | in | 32 | 29 |
Restructure dictionary#
# ## filter and sort
# ## remove ngrams with non-word characters
# bundles_4_2 = {(w1,w2,w3):value for (w1,w2,w3), value in bundles_4.items() if
# re.match(r'\w+',w1) and re.match(r'\w+',w2) and re.match(r'\w+',w3)}
# print(len(bundles_4))
# print(len(bundle_4_2))
# ## remove ngrams whose freq < 5 and w4 with non-word characters
# bundles_4_3 = {}
# for w1_w2_w3 in bundles_4_2:
# bundles_4_3[w1_w2_w3] = {w4:v for w4, v in bundles_4[w1_w2_w3].items() if v >= 5 and re.match(r'\w+',w4)}
# ## clean up dictionary
# bundles_4_3 = {key:value for key,value in bundles_4_3.items() if len(value)!=0}
# print(list(bundles_4_3.items())[:5])
# print(len(bundles_4_3))
# # From raw frequencies to forward transitional probabilities
# for w1_w2_w3 in bundles_4:
# total_count = float(sum(bundles_4[w1_w2_w3].values()))
# for w4 in bundles_4[w1_w2_w3]:
# bundles_4[w1_w2_w3][w4] /= total_count
# ## flatten the dictionary
# bundles_4_4 = {}
# for w1_w2_w3 in bundles_4_3:
# for w4 in bundles_4_3[w1_w2_w3]:
# ngram = list(w1_w2_w3)+[w4]
# bundles_4_4[tuple(ngram)] = bundles_4_3[w1_w2_w3][w4]
# sorted(bundles_4_4.items(), key=lambda x:x[1],reverse=True)