Lexical Bundles

Lexical Bundles#

This section talks about how to identify recurring multiword sequences from texts, which has received a lot of attention in recent years in language studies.

Loading libraries#

from nltk.corpus import reuters
from nltk import ngrams
from collections import Counter, defaultdict
import re

Corpus Data#

In this demonstration, we use the reuters corpus as our data source, which has been made available in the nltk.

## A quick look at the first five sentences
print([' '.join(sent) for sent in reuters.sents()[:5]])

["ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .", 'They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .', "But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .", "The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .", 'Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes .']

Lexical Bundles#

Lexical bundles refer to any contiguous multiword sequences from the texts. Normally, research on lexical bundles examine the multiword sequences of sizes from four- to seven-word sequences.

The idea of lexical bundles is essentially the ngrams in NLP, which N refers to the size of the multiword sequence.

To extract a meaningful set of lexical bundles, we need to consider at least two important distributional criteria:

Frequency of the bundle: how often does the sequence occur in the entire corpus?
Range of the bundle: in how many different texts/documents does the sequence occur in the entire corpus?

## Number of documents in `reuters`
len(reuters.fileids())

# Create a placeholder for 4gram bundles statistics
bundles_4 = defaultdict(lambda: defaultdict(lambda: 0))
bundles_range = defaultdict(lambda: defaultdict(lambda: 0))

[n for n in ngrams(reuters.sents()[1],n=4)]

[('They', 'told', 'Reuter', 'correspondents'),
 ('told', 'Reuter', 'correspondents', 'in'),
 ('Reuter', 'correspondents', 'in', 'Asian'),
 ('correspondents', 'in', 'Asian', 'capitals'),
 ('in', 'Asian', 'capitals', 'a'),
 ('Asian', 'capitals', 'a', 'U'),
 ('capitals', 'a', 'U', '.'),
 ('a', 'U', '.', 'S'),
 ('U', '.', 'S', '.'),
 ('.', 'S', '.', 'Move'),
 ('S', '.', 'Move', 'against'),
 ('.', 'Move', 'against', 'Japan'),
 ('Move', 'against', 'Japan', 'might'),
 ('against', 'Japan', 'might', 'boost'),
 ('Japan', 'might', 'boost', 'protectionist'),
 ('might', 'boost', 'protectionist', 'sentiment'),
 ('boost', 'protectionist', 'sentiment', 'in'),
 ('protectionist', 'sentiment', 'in', 'the'),
 ('sentiment', 'in', 'the', 'U'),
 ('in', 'the', 'U', '.'),
 ('the', 'U', '.', 'S'),
 ('U', '.', 'S', '.'),
 ('.', 'S', '.', 'And'),
 ('S', '.', 'And', 'lead'),
 ('.', 'And', 'lead', 'to'),
 ('And', 'lead', 'to', 'curbs'),
 ('lead', 'to', 'curbs', 'on'),
 ('to', 'curbs', 'on', 'American'),
 ('curbs', 'on', 'American', 'imports'),
 ('on', 'American', 'imports', 'of'),
 ('American', 'imports', 'of', 'their'),
 ('imports', 'of', 'their', 'products'),
 ('of', 'their', 'products', '.')]

%%time
# Count frequency of co-occurance  
for fid in reuters.fileids():
    temp = defaultdict(lambda: defaultdict(lambda: 0))
    for sentence in reuters.sents(fileids=fid):
        for w1, w2, w3, w4 in ngrams(sentence, n=4, pad_right=False, pad_left=False):
            ## filter
            if re.match(r'\w+',w1) and re.match(r'\w+',w2) and re.match(r'\w+',w3) and re.match(r'\w+', w4):
                bundles_4[(w1, w2, w3)][w4] += 1
                temp[(w1, w2, w3)][w4] += 1
    # range value
    for key, value in temp.items():
        for k in value.keys():
            bundles_range[key][k] +=1

CPU times: user 19.1 s, sys: 1.15 s, total: 20.2 s
Wall time: 21.8 s

list(bundles_4.items())[:5]

[(('ASIAN', 'EXPORTERS', 'FEAR'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'DAMAGE': 1})),
 (('EXPORTERS', 'FEAR', 'DAMAGE'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'FROM': 1})),
 (('FEAR', 'DAMAGE', 'FROM'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'U': 1})),
 (('JAPAN', 'RIFT', 'Mounting'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'trade': 1})),
 (('RIFT', 'Mounting', 'trade'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'friction': 1}))]

list(bundles_range.items())[:5]

[(('ASIAN', 'EXPORTERS', 'FEAR'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'DAMAGE': 1})),
 (('EXPORTERS', 'FEAR', 'DAMAGE'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'FROM': 1})),
 (('FEAR', 'DAMAGE', 'FROM'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'U': 1})),
 (('JAPAN', 'RIFT', 'Mounting'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>, {'trade': 1})),
 (('RIFT', 'Mounting', 'trade'),
  defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
              {'friction': 1}))]

Convert to data frames#

For more intuitive reading of the bundles data, we can create a data frame with the distributional information of each bundle type.
Most importantly, we can filter and sort our bundle data nicely and easily with the functionality provided with the data frame.

Create three lists:

w1_w2_w3: the first three words in the bundle
w4: the last word in the bundle
freq: freq of the bundle
range: range of the bundle

%%time
import pandas as pd

w1_w2_w3 = []
w4 = []
freq = []
rangev = []
for _w123 in bundles_4.keys():
    for _w4 in bundles_4[_w123].keys():
        w1_w2_w3.append('_'.join(_w123))
        w4.append(_w4)
        freq.append(bundles_4[_w123][_w4])
        rangev.append(bundles_range[_w123][_w4])
        

CPU times: user 1.19 s, sys: 81.7 ms, total: 1.27 s
Wall time: 3.05 s

Check the lengths of the four lists before combining them into a data frame.

print(len(w1_w2_w3))
print(len(w4))
print(len(freq))

691190
691190
691190

Create the bundle data frame.

bundles_df =pd.DataFrame(list(zip(w1_w2_w3, w4, freq, rangev)),
                        columns=['w123','w4','freq','range'])
bundles_df.head()

	w123	w4	freq	range
0	ASIAN_EXPORTERS_FEAR	DAMAGE	1	1
1	EXPORTERS_FEAR_DAMAGE	FROM	1	1
2	FEAR_DAMAGE_FROM	U	1	1
3	JAPAN_RIFT_Mounting	trade	1	1
4	RIFT_Mounting_trade	friction	1	1

Filter bundles whose range >= 10 and arrange the data frame according to bundles’ range values.

bundles_df[(bundles_df['range']>=10)].sort_values(['range'], ascending=[False]).head(20)

	w123	w4	freq	range
5717	Securities_and_Exchange	Commission	275	271
4813	said_in_a	statement	264	260
5714	the_Securities_and	Exchange	258	254
47163	3RD_QTR_NET	Shr	233	233
7112	The_company_said	the	230	211
46330	mln_Nine_mths	Shr	203	203
7103	The_company_said	it	213	197
6357	at_the_end	of	250	178
51576	4TH_QTR_NET	Shr	178	178
60176	with_the_Securities	and	162	162
25083	cts_prior_Pay	April	161	157
11887	pct_of_the	total	162	156
40339	QTR_LOSS_Shr	loss	142	142
24004	Inc_said_it	has	141	141
26751	1ST_QTR_NET	Shr	137	137
49905	QTR_JAN_31	NET	133	133
60168	a_filing_with	the	130	130
21944	said_it_expects	to	136	130
33141	JAN_31_NET	Shr	129	129
9673	The_Bank_of	England	129	126

Identify bundles with w4 being either in or to.

bundles_df[(bundles_df['range']>=10) & (bundles_df['w4'].isin(['in','to']))].sort_values(['range'], ascending=[False]).head(20)

	w123	w4	freq	range
21944	said_it_expects	to	136	130
33219	said_it_agreed	to	113	111
42141	said_it_plans	to	84	82
88616	agreed_in_principle	to	75	75
45606	letter_of_intent	to	72	71
85882	it_has_agreed	to	48	48
60690	a_definitive_agreement	to	48	48
62568	cts_a_share	in	65	47
37697	dlrs_a_share	in	54	45
769	who_asked_not	to	41	40
1603	in_an_effort	to	38	38
85883	it_has_agreed	in	35	35
25651	will_be_used	to	34	34
42760	5_mln_dlrs	in	33	33
2215	in_the_year	to	42	33
29471	transaction_is_subject	to	34	32
65576	will_be_able	to	33	31
60148	raised_its_stake	in	31	31
33220	said_it_agreed	in	32	31
54249	dlrs_per_share	in	32	29

Restructure dictionary#

# ## filter and sort

# ## remove ngrams with non-word characters
# bundles_4_2 = {(w1,w2,w3):value for (w1,w2,w3), value in bundles_4.items() if 
#                re.match(r'\w+',w1) and re.match(r'\w+',w2) and re.match(r'\w+',w3)}

# print(len(bundles_4))
# print(len(bundle_4_2))

# ## remove ngrams whose freq < 5 and w4 with non-word characters
# bundles_4_3 = {}
# for w1_w2_w3 in bundles_4_2:
#     bundles_4_3[w1_w2_w3] = {w4:v for w4, v in bundles_4[w1_w2_w3].items() if v >= 5 and re.match(r'\w+',w4)}

# ## clean up dictionary
# bundles_4_3 = {key:value for key,value in bundles_4_3.items() if len(value)!=0}
    
# print(list(bundles_4_3.items())[:5])
# print(len(bundles_4_3))

#  # From raw frequencies to forward transitional probabilities
# for w1_w2_w3 in bundles_4:
#     total_count = float(sum(bundles_4[w1_w2_w3].values()))
#     for w4 in bundles_4[w1_w2_w3]:
#         bundles_4[w1_w2_w3][w4] /= total_count

# ## flatten the dictionary
# bundles_4_4 = {}
# for w1_w2_w3 in bundles_4_3:
#     for w4 in bundles_4_3[w1_w2_w3]:
#         ngram = list(w1_w2_w3)+[w4]
#         bundles_4_4[tuple(ngram)] = bundles_4_3[w1_w2_w3][w4]

# sorted(bundles_4_4.items(), key=lambda x:x[1],reverse=True)

Lexical Bundles

Contents

Lexical Bundles#

Loading libraries#

Corpus Data#

Lexical Bundles#

Convert to data frames#

Restructure dictionary#