Lexical Bundles#

This section talks about how to identify recurring multiword sequences from texts, which has received a lot of attention in recent years in language studies.

Loading libraries#

from nltk.corpus import reuters
from nltk import ngrams
from collections import Counter, defaultdict
import re

Corpus Data#

In this demonstration, we use the reuters corpus as our data source, which has been made available in the nltk.

## A quick look at the first five sentences
print([' '.join(sent) for sent in reuters.sents()[:5]])
["ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .", 'They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .', "But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .", "The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .", 'Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes .']

Lexical Bundles#

Lexical bundles refer to any contiguous multiword sequences from the texts. Normally, research on lexical bundles examine the multiword sequences of sizes from four- to seven-word sequences.

The idea of lexical bundles is essentially the ngrams in NLP, which N refers to the size of the multiword sequence.

To extract a meaningful set of lexical bundles, we need to consider at least two important distributional criteria:

  • Frequency of the bundle: how often does the sequence occur in the entire corpus?

  • Range of the bundle: in how many different texts/documents does the sequence occur in the entire corpus?

## Number of documents in `reuters`
len(reuters.fileids())
10788
# Create a placeholder for 4gram bundles statistics
bundles_4 = defaultdict(lambda: defaultdict(lambda: 0))
bundles_range = defaultdict(lambda: defaultdict(lambda: 0))
[n for n in ngrams(reuters.sents()[1],n=4)]
[('They', 'told', 'Reuter', 'correspondents'),
 ('told', 'Reuter', 'correspondents', 'in'),
 ('Reuter', 'correspondents', 'in', 'Asian'),
 ('correspondents', 'in', 'Asian', 'capitals'),
 ('in', 'Asian', 'capitals', 'a'),
 ('Asian', 'capitals', 'a', 'U'),
 ('capitals', 'a', 'U', '.'),
 ('a', 'U', '.', 'S'),
 ('U', '.', 'S', '.'),
 ('.', 'S', '.', 'Move'),
 ('S', '.', 'Move', 'against'),
 ('.', 'Move', 'against', 'Japan'),
 ('Move', 'against', 'Japan', 'might'),
 ('against', 'Japan', 'might', 'boost'),
 ('Japan', 'might', 'boost', 'protectionist'),
 ('might', 'boost', 'protectionist', 'sentiment'),
 ('boost', 'protectionist', 'sentiment', 'in'),
 ('protectionist', 'sentiment', 'in', 'the'),
 ('sentiment', 'in', 'the', 'U'),
 ('in', 'the', 'U', '.'),
 ('the', 'U', '.', 'S'),
 ('U', '.', 'S', '.'),
 ('.', 'S', '.', 'And'),
 ('S', '.', 'And', 'lead'),
 ('.', 'And', 'lead', 'to'),
 ('And', 'lead', 'to', 'curbs'),
 ('lead', 'to', 'curbs', 'on'),
 ('to', 'curbs', 'on', 'American'),
 ('curbs', 'on', 'American', 'imports'),
 ('on', 'American', 'imports', 'of'),
 ('American', 'imports', 'of', 'their'),
 ('imports', 'of', 'their', 'products'),
 ('of', 'their', 'products', '.')]
%%time
# Count frequency of co-occurance  
for fid in reuters.fileids():
    temp = defaultdict(lambda: defaultdict(lambda: 0))
    for sentence in reuters.sents(fileids=fid):
        for w1, w2, w3, w4 in ngrams(sentence, n=4, pad_right=False, pad_left=False):
            ## filter
            if re.match(r'\w+',w1) and re.match(r'\w+',w2) and re.match(r'\w+',w3) and re.match(r'\w+', w4):
                bundles_4[(w1, w2, w3)][w4] += 1
                temp[(w1, w2, w3)][w4] += 1
    # range value
    for key, value in temp.items():
        for k in value.keys():
            bundles_range[key][k] +=1
CPU times: user 19.1 s, sys: 1.15 s, total: 20.2 s
Wall time: 21.8 s
list(bundles_4.items())[:5]
[(('ASIAN', 'EXPORTERS', 'FEAR'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'DAMAGE': 1})),
 (('EXPORTERS', 'FEAR', 'DAMAGE'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'FROM': 1})),
 (('FEAR', 'DAMAGE', 'FROM'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'U': 1})),
 (('JAPAN', 'RIFT', 'Mounting'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'trade': 1})),
 (('RIFT', 'Mounting', 'trade'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'friction': 1}))]
list(bundles_range.items())[:5]
[(('ASIAN', 'EXPORTERS', 'FEAR'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'DAMAGE': 1})),
 (('EXPORTERS', 'FEAR', 'DAMAGE'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'FROM': 1})),
 (('FEAR', 'DAMAGE', 'FROM'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'U': 1})),
 (('JAPAN', 'RIFT', 'Mounting'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'trade': 1})),
 (('RIFT', 'Mounting', 'trade'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'friction': 1}))]

Convert to data frames#

  • For more intuitive reading of the bundles data, we can create a data frame with the distributional information of each bundle type.

  • Most importantly, we can filter and sort our bundle data nicely and easily with the functionality provided with the data frame.

Create three lists:

  • w1_w2_w3: the first three words in the bundle

  • w4: the last word in the bundle

  • freq: freq of the bundle

  • range: range of the bundle

%%time
import pandas as pd

w1_w2_w3 = []
w4 = []
freq = []
rangev = []
for _w123 in bundles_4.keys():
    for _w4 in bundles_4[_w123].keys():
        w1_w2_w3.append('_'.join(_w123))
        w4.append(_w4)
        freq.append(bundles_4[_w123][_w4])
        rangev.append(bundles_range[_w123][_w4])
        
CPU times: user 1.19 s, sys: 81.7 ms, total: 1.27 s
Wall time: 3.05 s

Check the lengths of the four lists before combining them into a data frame.

print(len(w1_w2_w3))
print(len(w4))
print(len(freq))
691190
691190
691190

Create the bundle data frame.

bundles_df =pd.DataFrame(list(zip(w1_w2_w3, w4, freq, rangev)),
                        columns=['w123','w4','freq','range'])
bundles_df.head()
w123 w4 freq range
0 ASIAN_EXPORTERS_FEAR DAMAGE 1 1
1 EXPORTERS_FEAR_DAMAGE FROM 1 1
2 FEAR_DAMAGE_FROM U 1 1
3 JAPAN_RIFT_Mounting trade 1 1
4 RIFT_Mounting_trade friction 1 1

Filter bundles whose range >= 10 and arrange the data frame according to bundles’ range values.

bundles_df[(bundles_df['range']>=10)].sort_values(['range'], ascending=[False]).head(20)
w123 w4 freq range
5717 Securities_and_Exchange Commission 275 271
4813 said_in_a statement 264 260
5714 the_Securities_and Exchange 258 254
47163 3RD_QTR_NET Shr 233 233
7112 The_company_said the 230 211
46330 mln_Nine_mths Shr 203 203
7103 The_company_said it 213 197
6357 at_the_end of 250 178
51576 4TH_QTR_NET Shr 178 178
60176 with_the_Securities and 162 162
25083 cts_prior_Pay April 161 157
11887 pct_of_the total 162 156
40339 QTR_LOSS_Shr loss 142 142
24004 Inc_said_it has 141 141
26751 1ST_QTR_NET Shr 137 137
49905 QTR_JAN_31 NET 133 133
60168 a_filing_with the 130 130
21944 said_it_expects to 136 130
33141 JAN_31_NET Shr 129 129
9673 The_Bank_of England 129 126

Identify bundles with w4 being either in or to.

bundles_df[(bundles_df['range']>=10) & (bundles_df['w4'].isin(['in','to']))].sort_values(['range'], ascending=[False]).head(20)
w123 w4 freq range
21944 said_it_expects to 136 130
33219 said_it_agreed to 113 111
42141 said_it_plans to 84 82
88616 agreed_in_principle to 75 75
45606 letter_of_intent to 72 71
85882 it_has_agreed to 48 48
60690 a_definitive_agreement to 48 48
62568 cts_a_share in 65 47
37697 dlrs_a_share in 54 45
769 who_asked_not to 41 40
1603 in_an_effort to 38 38
85883 it_has_agreed in 35 35
25651 will_be_used to 34 34
42760 5_mln_dlrs in 33 33
2215 in_the_year to 42 33
29471 transaction_is_subject to 34 32
65576 will_be_able to 33 31
60148 raised_its_stake in 31 31
33220 said_it_agreed in 32 31
54249 dlrs_per_share in 32 29

Restructure dictionary#

# ## filter and sort

# ## remove ngrams with non-word characters
# bundles_4_2 = {(w1,w2,w3):value for (w1,w2,w3), value in bundles_4.items() if 
#                re.match(r'\w+',w1) and re.match(r'\w+',w2) and re.match(r'\w+',w3)}
# print(len(bundles_4))
# print(len(bundle_4_2))
# ## remove ngrams whose freq < 5 and w4 with non-word characters
# bundles_4_3 = {}
# for w1_w2_w3 in bundles_4_2:
#     bundles_4_3[w1_w2_w3] = {w4:v for w4, v in bundles_4[w1_w2_w3].items() if v >= 5 and re.match(r'\w+',w4)}

# ## clean up dictionary
# bundles_4_3 = {key:value for key,value in bundles_4_3.items() if len(value)!=0}
    
# print(list(bundles_4_3.items())[:5])
# print(len(bundles_4_3))
#  # From raw frequencies to forward transitional probabilities
# for w1_w2_w3 in bundles_4:
#     total_count = float(sum(bundles_4[w1_w2_w3].values()))
#     for w4 in bundles_4[w1_w2_w3]:
#         bundles_4[w1_w2_w3][w4] /= total_count
# ## flatten the dictionary
# bundles_4_4 = {}
# for w1_w2_w3 in bundles_4_3:
#     for w4 in bundles_4_3[w1_w2_w3]:
#         ngram = list(w1_w2_w3)+[w4]
#         bundles_4_4[tuple(ngram)] = bundles_4_3[w1_w2_w3][w4]
# sorted(bundles_4_4.items(), key=lambda x:x[1],reverse=True)