1 Loading packages
2 Objectives
In this tutorial, I would like to talk about the relationship between a construction and words. Words may co-occur to form collocation patterns. When words co-occur with a particular morphosyntactic pattern, they would form collostruction patterns.
Here I would like to introduce a widely-applied method for research on the meanings of constructional schemas—Collostructional Aanalysis (Stefanowitsch & Gries, 2003). This is the major framework in corpus linguistics for the study of the relationship between words and constructions.
The idea behind collostructional analysis is simple: the meaning of a morphosyntactic construction can be determined very often by its co-occurring words.
In particular, words that are strongly associated (i.e., co-occurring) with the construction are referred to as collexemes of the construction.
Collostructional Analysis is an umbrella term, which covers several sub-analyses for constructional semantics:
- Collexeme Analysis (cf. Stefanowitsch & Gries (2003))
- Co-varying Collexeme Analysis (cf. Stefanowitsch & Gries (2005))
- (Multiple) Distinctive Collexeme Analysis (cf. Stefan T. Gries & Stefanowitsch (2004))
This tutorial will first focus on collexeme analysis, whose principles can be extended to the other analyses.
Also, I will demonstrate how we can conduct a collexeme analysis by using the R script written by Stefan Gries (Collostructional Analysis).
3 Corpus
As a demonstration, we use the Apple News Corpus
(demo_data/applenews10000.tar.gz
).
Our target is the Chinese construction X + 起來
. The
goal is to discover which words most frequently fill the X
slot and what this reveals about the semantics of the construction.
That is, we want to identify the collexemes of X + 起來
and examine how coherent they are in meaning.
We begin by loading the corpus and creating a corpus
object.
## Load Data
apple_df <-
readtext("demo_data/applenews10000.tar.gz", encoding = "UTF-8") %>% ## loading data
filter(!str_detect(text, "^\\s*$")) %>% ## removing empty docs
mutate(doc_id = row_number()) ## create index
Raw texts often contain unwanted characters such as invisible control codes, extra spaces, or duplicate line breaks.
These can interfere with segmentation accuracy. It is therefore a good idea to clean the texts before tokenization.
Here we used a simple function, normalize_document()
,
but you can always design a more specialized version for your own
data.
- Define how we would like to preprocess the data:
## Text normalization function
## Define a function
normalize_document <- function(texts) {
texts %>%
str_replace_all("\\p{C}", " ") %>% ## remove control chars
str_replace_all("[\\s|\\n]+", " ") %>% ## replace whitespaces/linebreaks with one whitespace
trimws
}
- An example:
## [1] "《蘋果體育》即日起進行虛擬賭盤擂台,每名受邀參賽者進行勝負預測,每周結算在周二公布,累積勝率前3高參賽者可繼續參賽,單周勝率最高者,將加封「蘋果波神」頭銜。註:賭盤賠率如有變動,以台灣運彩為主。\n資料來源:NBA官網http://www.nba.com\n\n金塊(客) 103:92 76人騎士(主) 88:82 快艇活塞(客) 92:75 公牛勇士(客) 108:82 灰熊熱火(客) 103:82 灰狼籃網(客) 90:82 公鹿溜馬(客) 111:100 馬刺國王(客) 112:102 爵士小牛(客) 108:106 拓荒者\n\n"
## [1] "《蘋果體育》即日起進行虛擬賭盤擂台,每名受邀參賽者進行勝負預測,每周結算在周二公布,累積勝率前3高參賽者可繼續參賽,單周勝率最高者,將加封「蘋果波神」頭銜。註:賭盤賠率如有變動,以台灣運彩為主。 資料來源:NBA官網http://www.nba.com 金塊(客) 103:92 76人騎士(主) 88:82 快艇活塞(客) 92:75 公牛勇士(客) 108:82 灰熊熱火(客) 103:82 灰狼籃網(客) 90:82 公鹿溜馬(客) 111:100 馬刺國王(客) 112:102 爵士小牛(客) 108:106 拓荒者"
- Preprocess all texts in the corpus
4 Word Segmentation
Next we segment the texts into words using jiebaR.
The
steps are:
- Initialize a
jiebar
segmentation model withworker()
. - Define a function
word_seg_text()
to tokenize the text and, if needed, add POS tags. - Apply this function to all texts and create a new column containing the segmented strings.
This new column will serve as the basis for later construction extraction.
## Initialize jiebar
segmenter <- worker(user = "demo_data/dict-ch-user.txt", ## User-defined Dictionary
bylines = FALSE, ## ignore lines
symbol = FALSE) ## ignore symbols
## Define function
word_seg_text <- function(text, jiebar) {
segment(as.character(text), jiebar) %>% ## word segmentation
str_c(collapse = " ") ## concatenate word tokens into long strings
}
## Apply the function
apple_df <- apple_df %>%
mutate(text_tag = map_chr(text, ~ word_seg_text(.x, segmenter)))
The function word_seg_text()
tokenizes Chinese text into
words and produces an enriched version of the text with explicit word
boundaries.
## [1] "蘋果 體育 即日起 進行 虛擬 賭盤 擂台 每名 受邀 參賽者 進行 勝負 預測 每周 結算 在 周二 公布 累積 勝率 前 3 高 參賽者 可 繼續 參賽 單周 勝率 最高者 將 加封 蘋果 波神 頭銜 註 賭盤 賠率 如有 變動 以 台灣 運彩 為主 資料 來源 NBA 官網 http www nba com 金塊 客 103 92 76 人 騎士 主 88 82 快艇 活塞 客 92 75 公牛 勇士 客 108 82 灰熊 熱火 客 103 82 灰狼 籃網 客 90 82 公鹿 溜 馬 客 111 100 馬刺 國王 客 112 102 爵士 小牛 客 108 106 拓荒者"
In this example we did not add POS tags in order to keep things
simple. For actual research, POS information can be very useful for
retrieving constructions more accurately. You may therefore want to
adapt both the word_seg_text()
function and the
worker()
model to include POS tagging.
5 Extract Constructions
With segmented texts, we can now extract our target construction
X + 起來
using a regular expression and
unnest_tokens()
.
Both regular expressions and corpus query languages share the idea of pattern matching. They let you describe a search pattern in a formal way and then scan through text to find all the places where that pattern occurs. In both cases, you write rules instead of manually searching, which makes them powerful for handling large amounts of data.
The key difference is that regex patterns work only on surface strings, while corpus query languages extend this logic to structured linguistic information.
- Define the constructional pattern:
- Extract the patterns from the corpus
## Extract patterns
apple_qilai <- apple_df %>%
select(-text) %>% ## dont need original texts
unnest_tokens(
output = construction, ## name for new unit
input = text_tag, ## name of old unit
token = function(x) ## unnesting function
str_match_all(x, pattern = pattern_qilai)
)
- Check the results
6 Exploratory/Descriptive Analysis
Usually before the statistical/quantitative analysis, we examine our data in an exploratory way.
apple_qilai %>%
group_by(construction) %>%
summarize(freq = n(),
docfreq = n_distinct(doc_id)) -> apple_qilai_dist
apple_qilai_dist %>%
arrange(-freq) %>%
top_n(50, freq)
7 Distributional Information Needed for Collostruction Analysis
To perform the collexeme analysis, which is
essentially a statistical analysis of the association between words and
a specific construction, we need to collect necessary distributional
information of the words (X
) and the construction
(X + 起來
).
In particular, to use Stefan Gries’ R script of Collostructional Analysis, we need three types of frequency information:
- Joint frequencies of each word with the construction.
- Word frequencies in the entire corpus.
- Total corpus size (number of tokens).
Example with the word 使用
:
- Joint frequency = number of times
使用 + 起來
occurs. - Word frequency = total count of
使用
in the corpus. - Corpus size = total number of word tokens.
7.1 Word Frequency List
We start by creating a word frequency list for the corpus.
This involves converting the text-based data frame into a word-based one and counting token frequencies.
## create word freq list
apple_word_freq <- apple_df %>%
select(-text) %>% ## dont need original raw texts
unnest_tokens( ## tokenization
word, ## new unit
text_tag, ## old unit
token = function(x) ## tokenization function
str_split(x, "\\s+")
) %>%
filter(nzchar(word)) %>% ## remove empty strings
count(word, sort = T)
apple_word_freq %>%
head(100)
In the above example, when we convert our data frame from a
text-based to a word-based one, we didn’t use any specific tokenization
function in unnest_tokens()
because we have already
obtained the enriched version of the texts, i.e., texts
where each word token is delimited by a white-space. Therefore, the
unnest_tokens()
here is a lot simpler: we simply tokenize
the texts into word tokens based on the known
delimiter, i.e., the white-spaces.
7.2 Joint Frequencies
We now calculate the joint frequencies of X
and the
construction X + 起來
.
Since we also have the word frequency list, we can link these counts together.
## Joint frequency table
apple_qilai_freq <- apple_qilai %>%
count(construction, sort = T) %>% ## get joint frequencies
tidyr::separate(col = "construction", ## restructure data frame
into = c("word", "construction"),
sep = "\\s") %>%
## identify the freq of X in X_起來
mutate(word_freq = apple_word_freq$n[match(word, apple_word_freq$word)])
apple_qilai_freq
7.3 Input for
coll.analysis.r
Now we have almost all distributional information needed for the Collostructional Analysis.
Let’s see how we can use Stefan Gries’ script,
coll.analysis.r
, to perform the collostructional analysis
on our data set.
Stefan Gries’ coll.analysis.r
expects a
tab-delimited file with three columns:
- Word
- Word frequency in the corpus
- Joint frequency with the construction
## prepare a tsv
## for coll analysis
apple_qilai_freq %>%
select(word, word_freq, n) %>%
write_tsv("qilai.tsv")
Be sure to save the input as a TSV (tab-delimited) file, not CSV. The script will not accept comma-delimited input.
7.4 Corpus Size
In addition to the input file, Stefan Gries’
coll.analysis.r
also requires a few general statistics for
the computing of association measures.
We prepare necessary distributional information for the later collostructional analysis:
- Corpus size: The total number of words in the corpus
- Construction size: the total number of the construction tokens in the corpus
Later when we run Gries’ script, we need to enter these numbers manually in the terminal.
## Corpus Size: 2615074
## Construction Size: 545
Sometimes you may want to save console output for later use.
The function sink()
redirects printed output to a
file.
## save info in a text
sink("qilai_info.txt") ## start flushing outputs to the file not the terminal
cat("Corpus Size: ", sum(apple_word_freq$n), "\n")
cat("Construction Size: ", sum(apple_qilai_freq$n), "\n")
sink() ## end flushing
This will create a file qilai_info.txt
in your working
directory with the saved results.
8 Running Collostructional Analysis
We are now ready to run the analysis by sourcing coll.analysis.r
,
available on Stefan Gries’ website.
######################################
## Preloading Gries' Functions ##
######################################
source("https://www.stgries.info/teaching/groningen/coll.analysis.r")
## Run the interactive analysis
coll.analysis()
The script runs interactively, asking you to provide the required information.
For our example, the answers would be:
Analysis to perform
: 1Name of construction
: QLCorpus size
: 2615074Fisher-Yates
: noTab-delimited input data
:qilai.tsv
If everything runs correctly, you will get a tab-delimited CSV output file in your working directory with the results.
9 Interpretations
The output is a tab-delimited file with statistics for each candidate collexeme.
A sample result file is available in the root directory
2025-10-XX XX-XX-XXXX.csv
.
- We load the results and check the top collexemes:
## convert into CSV
collo_table<-read_tsv("2025-10-03 08-35-47.915105.csv")
## auto-print
collo_table %>%
filter(RELATION =="attraction") %>%
arrange(desc(LLR)) %>%
head(50) %>%
select(WORD, LLR, everything())
The output includes several useful COLLSTRENGTH
measures, specifying the association between the collexemes and the slot
of the construction (V + 起來
).
- LLR/PMI
- Log Odds Ratio
- Delta P
- Kullback-Leibler Divergence (KLD)
With these statistics, we can identify the strongest collexemes according to different measures.
Here we focus on the top 10 collexemes by four metrics:
QL
: joint frequency of word and constructionDELTAPC2W
: delta P (construction → word)DELTAPW2C
: delta P (word → construction)LLR
: log-likelihood ratio
## from wide to long
collo_table %>%
filter(RELATION == "attraction") %>%
filter(QL >=5) %>%
select(WORD, QL,
DELTAPC2W,
DELTAPW2C,
LLR) %>%
pivot_longer(cols=c("QL",
"DELTAPC2W",
"DELTAPW2C",
"LLR"),
names_to = "METRIC",
values_to = "STRENGTH") %>%
mutate(METRIC = factor(METRIC,
levels = c("QL",
"DELTAPC2W",
"DELTAPW2C",
"LLR"))) %>%
group_by(METRIC) %>%
top_n(10, STRENGTH) %>%
ungroup %>%
arrange(METRIC, desc(STRENGTH)) -> coll_table_long
## plot
coll_table_long %>%
mutate(WORD = reorder_within(WORD, within= METRIC, by=STRENGTH)) %>%
ggplot(aes(WORD, STRENGTH, fill=METRIC)) +
geom_col(show.legend = F) +
scale_fill_brewer(palette = "Dark2") +
coord_flip() +
facet_wrap(~METRIC,scales = "free") +
tidytext::scale_x_reordered() +
labs(x = "Collexemes",
y = "Strength",
title = "Collexemes Ranked by Different Metrics")
The bar plots above show the top 10 collexemes based on four
different metrics: QL
, DELTAPC2W
,
DELTAPW2C
, and LLR
.
Please review the assigned readings for details on how collostructional strengths are calculated. In particular, Stefanowitsch & Gries (2003) explains why Fisher’s exact test-based measures provide advantages over raw frequency counts.
Also note that delta P is a directional association measure that has been increasingly used in psycholinguistic research (see Ellis (2006); Stefan Th Gries (2013)).
It is important to understand both how delta P is computed and how it should be interpreted.
If we look at the top 10 collexemes ranked by the collostrength, we
would see a few puzzling collexemes, such as 一
,
了
, 不
. Please identify these puzzling
construction tokens as concordance lines (using
quanteda::kwic()
) and discuss their issues and potential
solutions.
10 Distinctive Collexeme Analysis
We can also use Stefan Gries’ script coll.analysis.r
to
perform a Distinctive Collexeme Analysis.
To illustrate this method, let’s use data from the following study:
Huang, P. W., & Chen, A. C. H. (2022). Degree adverbs in spoken Mandarin: A behavioral profile corpus-based approach to language alternatives. Concentric: Studies in Linguistics, 48(2), 285–322.
In this work, the focus is on differences among four Mandarin degree
adverbs: 很
, 蠻/滿
, 超
, and
太
.
As before, we run the interactive function
coll.analysis()
.
For this example, the answers to the prompts would be:
analysis to perform
: 2 (Distinctive Collexeme Analysis)
number of distinctive categories
: 2 (or more if there are 3+ alternatives)
tab-delimited input data
:<demo_data/input-distcollexeme.tsv>
For a Distinctive Collexeme Analysis, the input should be a data frame that lists each construction token in the corpus, with two key columns:
- The collexeme (the co-occurring word)
- The constructional element (the construction category being compared)
In our case, the constructional elements are the degree adverbs, and the collexemes are their co-occurring modifyees.
The following figure shows how these were annotated in concordance lines:
For Stefan’s script, we only need these two columns from the data: the collexemes and their associated constructional element.
Once the script runs, it generates the results as a tab-delimited CSV file in the root directory.
An example output file is included in this project.
### Distinctive Collexeme Analysis: Results Exploration
dist_results <- read_tsv("2025-10-02 04-10-56.334203.csv")
dist_results
In distinctive collexeme analysis, the two columns mean:
SUMABSDEV
shows how unevenly a word
appears across the constructions. The larger this number, the more the
word favors one construction over the others.
LARGESTPREF
tells you which
construction the word favors most — the one where it appears more often
than expected.
So, if a collocate has a high SUMABSDEV
and
LARGESTPREF = A
, it means the word occurs much more often
in Construction A than chance would predict.
And our next step is to explore the semantic patterns from the output:
dist_results %>%
group_by(LARGESTPREF) %>%
top_n(10, SUMABSDEV) %>%
select(COLLOCATE, SUMABSDEV, LARGESTPREF) %>%
ungroup %>%
arrange(LARGESTPREF) %>%
mutate(COLLOCATE = reorder_within(COLLOCATE, within= LARGESTPREF, by=SUMABSDEV)) %>%
ggplot(aes(COLLOCATE, SUMABSDEV, fill=SUMABSDEV)) +
geom_col(show.legend = F) +
scale_fill_distiller(palette="YlOrRd", direction=1)+
coord_flip() +
facet_wrap(~LARGESTPREF,scales = "free") +
tidytext::scale_x_reordered() +
labs(x = "COLLOCATE",
y = "STRENGTH(SUMABSDEV)",
title = "DISTINCTIVE COLLEXEMES")