Chapter 10 Structured Corpus
library(tidyverse)
library(readtext)
library(tidytext)
library(quanteda)
There are a lot of pre-collected corpora available for linguistic studies. Unlike self-collected text corpora, these structured corpora are usually provided in a structured format and the text data are often enriched with annotations. When presenting/distributing the structured corpus data, the corpus provider can determine the ultimate format and structure in which the users can access the text data.
Important factors usually include:
- Web Interface: An online interface for users to access the corpus data
- Annotation: Additional linguistic annotations at different levels
- Metadata: Additional demographic information for the corpus data
- Full-Text Availability: A commercial/non-commercial license to download the full-texts of the corpus data
- Full-Text Formats: The formats of the full-texts data
Web Interface | Annotation | Metadata | Full-Text Availability |
Full-Text Formats |
|
---|---|---|---|---|---|
Capacities | GUI CQL Syntax |
POS LEMMA WordSeg |
Demographic Info Genre |
Commercial Academic License |
Texts XML JSON Textgrids XML |
Sinica Corpus | ✓ | ✓ | ✓ | ✓ | XML |
COCT | ✓ | ✓ | ✓ | ||
COCA | ✓ | ✓ | ✓ | ✓ | Raw Texts |
BNC2014 | ✓ | ✓ | ✓ | ✓ | XML |
CHILDES | ? | ✓ | ✓ | ✓ | CHILDES |
ICNALE | ✓ | ✓ | Raw Texts |
This chapter will demonstrate how you can load existing corpora in R and perform basic corpus analysis with these data. To facilitate the sharing of corpus data, the corpus linguistic community has now settled on a few common schemes for textual data storage and exchange. In particular, I would like to talk about two common types of corpus data representation: CHILDES (this chapter) and XML (Chapter 11).
10.1 NCCU Spoken Mandarin
In this demonstration, I will use the dataset of Taiwan Mandarin Corpus for illustration. This dataset, collected by Prof. Kawai Chui and Prof. Huei-ling Lai at National Cheng-Chi University (Chui et al., 2017; Chui & Lai, 2008), includes spontaneous face-to-face conversations of Taiwan Mandarin. The data transcription conventions can be found on the NCCU Corpus Official Website.
Generally, the NCCU corpus transcripts follow the conventions of CHILDES format. In computational text analytics, the first step is always to analyze the structure of the textual data.
10.2 CHILDES Format
The following is an excerpt from the file demo_data/data-nccu-M001.cha
from the NCCU Corpus of Taiwan Mandarin. The conventions of CHILDES transcription include:
- The lines with header information begin with
@
- The lines with utterances begin with
*
- The indented lines refer to the utterances of the continuing speaker turn
- Words are separated by white-spaces (i.e., a word-segmented corpus)
The meanings of transcription symbols used in the corpus can be found in the documention of the corpus.
10.3 Loading the Corpus
The corpus data are available in our demo_data/corp-NCCU-SPOKEN.tar.gz
, which is a zipped archived file, i.e., one zipped tar file including all the corpus documents.
We can use the readtext::readtext()
to load the data.
By default, readtext()
assumes that the archived file consists of documents in .txt
format.
In this step, we treat all the *.cha
files as if they are normal text files (i.e. .txt
) and load the entire corpus into a data frame with two columns: doc_id
and text
(The warning messages only warn you that by default readtext()
takes only .txt
files).
## Load corpus
<- as_tibble(readtext("demo_data/corp-NCCU-SPOKEN.tar.gz", encoding = "UTF-8")) NCCU
10.4 From Text-based to Turn-based DF
Now the data frame NCCU
is a text-based one, where each row refers to one transcript file in the corpus.
- Before we do further tokenization, we first need to concatenate all same-turn utterances (i.e., utterances with no speaker ID at the initial of the line) with their initial utterance of the speaker turn;
- Then we use
unnest_tokens()
to transform the text-based DF into a turn-based DF.
## From text-based DF to turn-based DF
<- NCCU %>%
NCCU_turns mutate(text = str_replace_all(text,"\n\t"," ")) %>% # deal with same-speaker-turn utterances
unnest_tokens(turn, ## name for new unit
## name for old unit
text, token = function(x) ## self-defined tokenizer
str_split(x, pattern = "\n"))
## Inspect first file
%>%
NCCU_turns filter(doc_id == "M001.cha")
10.5 Metadata vs. Utterances
Lines starting with @
are the headers of the transcript while lines starting with *
are the utterances of the conversation. We split our NCCU_turns
into:
NCCU_turns_meta
: a DF with all header linesNCCU_turns_utterance
: a DF with all utterance lines
## Metadata DF
<- NCCU_turns %>%
NCCU_turns_meta filter(str_detect(turn, "^@")) ## extract all lines starting with `@`
When extracting all the utterances of the speaker turns, we perform data preprocessing as well. Specifically, we:
- Create unique indices for every speaker turn
- Extract the speaker index of each speaker turn (
SPID
) - Replace all pause tags with
<PAUSE>
- Replace all extra-linguistic tags with
<EXTRACLING>
- Remove all overlapping talk tags
- Remove all code-switching tags
- Remove duplicate/trailing/leading spaces
## Utterance DF
<- NCCU_turns %>%
NCCU_turns_utterance filter(str_detect(turn, "^\\*")) %>% ## extract all lines starting with `*`
group_by(doc_id) %>% ## create unique index for turns
mutate(turn_id = row_number()) %>%
%>%
ungroup ::separate(col="turn", ## split SPID + Utterances
tidyrinto = c("SPID", "turn"),
sep = ":\t") %>%
mutate(turn2 = turn %>% ## Clean up utterances
str_replace_all("\\([(\\.)0-9]+?\\)"," <PAUSE> ") %>% ## <PAUSE>
str_replace_all("\\&\\=[a-z]+"," <EXTRALING> ") %>% ## <EXTRALING>
str_replace_all("[\u2308\u2309\u230a\u230b]"," ") %>% ## overlapping talk tags
str_replace_all("@[a-z:]+"," ") %>% ## code switching tags
str_replace_all("\\s+"," ") %>% ## additional white-spaces
str_trim())
The turn-based DF, NCCU_turns_utterance
, includes the utterances of each speaker turn as well as the doc_id
, turn_id
and the SPID
. All these unique indices can help us connect each utterance back to the original conversation.
## Turns of `M001.cha`
%>%
NCCU_turns_utterance filter(doc_id == "M001.cha") %>%
select(doc_id, turn_id, SPID, turn, turn2)
10.6 Word-based DF and Frequency List
Because the NCCU corpus has been word segmented, we can easily transform the turn-based DF into a word-based DF using unnest_tokens()
. The key is that we need to specify our own tokenization function token = ...
.
The word tokenization now is simple – tokenize the utterance into words based on the delimiter of white-spaces.
## From turn DF to word DF
<- NCCU_turns_utterance %>%
NCCU_words select(-turn) %>% ## remove raw text column
unnest_tokens(word, ## name for new unit
## name for old unit
turn2, token = function(x) ## self-defined tokenizer
str_split(x, "\\s+"))
## Check
%>%
NCCU_words head(100)
With the word-based DF, we can create a word frequency list of the NCCU corpus.
## Create word freq list
<-NCCU_words %>%
NCCU_words_freq count(word, doc_id) %>%
group_by(word) %>%
summarize(freq = sum(n), dispersion = n()) %>%
arrange(desc(freq), desc(dispersion))
## Check
%>%
NCCU_words_freq head(100)
With word frequencies, we can generate a word cloud to have a quick overview of the word distributions in NCCU corpus.
## Create Wordcloud
require(wordcloud2)
%>%
NCCU_words_freq filter(str_detect(word, "^[^<a-z]")) %>% ## remove annotations/tags
select(word, freq) %>%
top_n(150, freq) %>% ## plot top 150 words
mutate(freq = sqrt(freq)) %>% ## deal with Zipfian distribution
wordcloud2(size=0.6)
10.7 Concordances
If we need to identify turns with a particular linguistic unit, we can make use of the data wrangling tricks to easily extract speaker turns with the target pattern.
We can make use of regular expressions again to extract more complex constructions and patterns from the utterances.
## extracting particular patterns
%>%
NCCU_turns_utterance filter(str_detect(turn2, "覺得"))
%>%
NCCU_turns_utterance filter(str_detect(turn2, "這樣子"))
Exercise 10.1 If we are interested in the use of the verb 覺得
, after we extract all the speaker turns with the verb 覺得
, we may need to know the subjects that often go with the verb.
Please identify the word before the verb for each concordance token as one independent column of the resulting data frame (see below). Please note that one speaker turn may have more than one use of
覺得
.Please create a barplot as shown below to summarize the distribution of the top 10 frequent words that directly precedes 覺得.
Among the top 10 words, you would see “的 覺得” combinations, which are counter-intuitive. Please examine these tokens and explain why.
Alternatively, we can also create a tokens
object and apply the kwic()
from quanteda
for concordance lines:
## Quanteda `tokens` object
<- NCCU_turns_utterance$turn2 %>%
NCCU_tokensstr_split("\\s+") %>% ## split into word-based list
## list2tokens
as.tokens
## check token numbers
sum(sapply(NCCU_tokens, length)) ## based on `tokens`
[1] 194670
nrow(NCCU_words) ## based on `unnest_tokens`
[1] 194670
## We can add docvars to `tokens`
docvars(NCCU_tokens)<- NCCU_turns_utterance[,c("doc_id","SPID", "turn_id")] %>%
mutate(Text = doc_id)
## KWIC
kwic(NCCU_tokens, "覺得", 8)
kwic(NCCU_tokens, "機車", 8)
10.8 Collocations (Bigrams)
Now we extend our analysis beyond single words.
Please recall the tokenizer_ngrams()
function we have defined in Chapter 7.
## self defined ngram tokenizer
<-
tokenizer_ngrams function(texts,
jiebar,n = 2 ,
skip = 0,
delimiter = "_") {
%>% ## chunks-based char vector
texts segment(jiebar) %>% ## word tokenization
%>% ## list to tokens
as.tokens tokens_ngrams(n, skip, concatenator = delimiter) %>% ## ngram tokenization
## tokens to list
as.list }
The function tokenizer_ngrams()
tokenizes the texts into word vectors using jiebaR
and based on the word vectors, it extracts the ngram vectors from each text.
In other words, it assumes that the input texts have NOT been word segmented. We can create a similar ngram tokenizer for our current NCCU corpus data:
## Self-defined ngram tokenizer
<-
tokenizer_ngrams_v2 function(texts,
n = 2 ,
skip = 0,
delimiter = "_") {
%>% ## chunks-based char vector
texts str_split(pattern = "\\s+") %>% ## word tokenization
%>% ## list to tokens
as.tokens tokens_ngrams(n, skip, concatenator = delimiter) %>% ## ngram tokenization
## tokens to list
as.list }
We use the self-defined tokenization function together with unnest_tokens()
to transform the turn-based DF into a bigram-based DF.
## From turn DF to bigram DF
system.time(
<- NCCU_turns_utterance %>%
NCCU_bigrams select(-turn) %>% ## remove raw texts
unnest_tokens(bigrams, ## name for new unit
## name for old unit
turn2, token = function(x) ## self-defined tokenizer
tokenizer_ngrams_v2(
texts = x,
n = 2))%>%
filter(bigrams!="")
)
user system elapsed
0.584 0.087 0.402
%>%
NCCU_bigrams filter(doc_id == "M001.cha")
Please note that when we perform the n-gram tokenization, we take each speaker turn as our input. This step is important because this would make sure that we don’t get bigrams that span different speaker turns.
To determine significant collocations in conversation, we can compute the relevant distributional statistics for each bigram type, including:
- Frequencies
- Dispersion
- Collocation Strength (Lexical Associations)
We first compute the frequencies and dispersion of all bigrams types:
## Bigram Joint Frequency & Dispersion
<- NCCU_bigrams %>%
NCCU_bigrams_freq count(bigrams, doc_id) %>%
group_by(bigrams) %>%
summarize(freq = sum(n), dispersion = n()) %>%
arrange(desc(freq), desc(dispersion))
## Check
%>%
NCCU_bigrams_freq top_n(100, freq)
## Check (para removed)
%>%
NCCU_bigrams_freq filter(!str_detect(bigrams, "<")) %>%
top_n(100, freq)
Exercise 10.2 In the above example, we compute the dispersion based on the number of documents where the bigram occurs. Please note that the dispersion can be defined on the basis of the speakers as well, i.e., the number of speakers who use the bigram at least once in the corpus.
How do we get dispersion statistics like this? Please show the top frequent 100 bigrams and their SPID-based dispersion statistics.
To compute the lexical associations, we need to:
- remove bigrams with para-linguistic tags
- exclude bigrams of low dispersion
- get necessary observed frequencies (e.g., w1 and w2 frequencies)
- get expected frequencies (for more advanced lexical association metrics)
## Computing lexical associations
%>%
NCCU_bigrams_freq filter(!str_detect(bigrams, "<")) %>% ## remove bigrams with para tags
filter(dispersion >= 5) %>% ## set bigram dispersion cut-off
rename(O11 = freq) %>%
::separate(col="bigrams", ## split bigrams into two columns
tidyrc("w1", "w2"),
sep="_") %>%
## Obtain w1 w2 freqs
mutate(R1 = NCCU_words_freq$freq[match(w1, NCCU_words_freq$word)],
C1 = NCCU_words_freq$freq[match(w2, NCCU_words_freq$word)]) %>%
## compute expected freq of bigrams
mutate(E11 = (R1*C1)/sum(O11)) %>%
## Compute lexical assoc
mutate(MI = log2(O11/E11),
t = (O11 - E11)/sqrt(E11)) %>%
mutate_if(is.double, round,2) -> NCCU_collocations
## Check
%>%
NCCU_collocations arrange(desc(dispersion), desc(MI)) # sorting by MI
%>%
NCCU_collocations arrange(desc(dispersion), desc(t)) # sorting by t
Exercise 10.3 Please compute the lexical associations of the bigrams using the log-likelihood ratios.
10.9 N-grams (Lexical Bundles)
We can also extend our analysis to n-grams of larger sizes, i.e., the lexical bundles.
## Turn to 4-gram DF
system.time(
<- NCCU_turns_utterance %>%
NCCU_ngrams select(-turn) %>% ## remove raw texts
unnest_tokens(ngram, ## name for new unit
## name for old unit
turn2, token = function(x) ## self-defined tokenizer
tokenizer_ngrams_v2(
texts = x,
n = 4))%>%
filter(ngram != "") ## remove empty tokens (due to the short lines)
)
user system elapsed
1.014 0.153 0.599
## 4-gram Frequency List
%>%
NCCU_ngrams count(ngram, doc_id) %>%
group_by(ngram) %>%
summarize(freq = sum(n), dispersion = n()) %>%
arrange(desc(dispersion), desc(freq)) %>%
%>%
ungroup filter(!str_detect(ngram,"<")) -> NCCU_ngrams_freq
## Check
%>%
NCCU_ngrams_freq filter(dispersion >= 5)
10.10 Connecting SPID to Metadata
So far the previous analyses have not used any information of the transcripts’ headers (metadata). In other words, the connection between the utterances and their corresponding speakers’ profiles are not transparent in our current corpus analysis.
However, for socio-linguists, the headers of the transcripts can be very informative.
For example, in the NCCU_turns_meta
, we have more demographic information of the speakers, which allows us to further examine the linguistic variations on various social factors (e.g., areas, ages, gender etc.)
## Utterances of `M001.cha`
%>%
NCCU_turns_utterance filter(doc_id == "M001.cha")
## Metadata information of speakers in `M001.cha`
%>%
NCCU_turns_meta filter(doc_id == "M001.cha")
10.11 Processing Corpus Headers
In this section, I would like to demonstrate how to extract speaker-related information from the headers (i.e., NCCU_turns_meta
) and link these speaker profiles to our corpus data (i.e., NCCU_turns_utterance
).
- In the headers of each transcript, the demographic profiles of each speaker are provided in the lines starting with
@id:\t
; - Each piece of information is separated by a pipe sign
|
in the line.
All speakers’ profiles in the corpus follow the same structure.
%>%
NCCU_turns_meta filter(str_detect(turn, "^@(id)"))
To parse the demographic data of the speaker profiles in the turn
column of NCCU_turns_meta
, we can:
- Extract all lines starting with
@id
- Separate the metadata string into several columns using
|
- Select relevant columns (speaker profiles)
- Rename the columns
- Create unique IDs for each speaker of each transcript
## Extract metadata for all speakers
<- NCCU_turns_meta %>%
NCCU_meta filter(str_detect(turn, "^@(id)")) %>% ## extract all ID lines
separate(col="turn", ## split SP info
into=str_c("V",1:11, sep=""),
sep = "\\|") %>%
select(doc_id, V2, V3, V4, V5, V7, V10) %>% ## choose relevant info
mutate(DOC_SPID = str_c(doc_id, V3, sep="_")) %>% ## unique SPID
rename(AGE = V4, ## renaming columns
GENDER = V5,
GROUP = V7,
RELATION = V10,
LANG = V2) %>%
select(-V3) %>% ## remove irrelevant column
mutate(AGE = as.integer(str_replace(AGE,";",""))) ## chr to integer
NCCU_meta
With the above demographic data frame of all speakers in NCCU, we can now explore the demographic distributions of the corpus.
## Gender by age distribution
%>%
NCCU_meta ggplot(aes(GENDER, AGE, fill = GENDER)) +
geom_boxplot() +
theme_light()
## Relation distribution
%>%
NCCU_meta count(RELATION) %>%
ggplot(aes(reorder(RELATION, n), n, fill = n)) +
geom_col() + coord_flip() +
labs(y = "Number of Speakers",
x = "Relation") +
scale_fill_gradient(guide = "none") +
theme_light()
10.12 Sociolinguistic Analyses
Now with NCCU_meta
and NCCU_turns_utterance
, we can now connect each utterance to a particular speaker (via `doc_id
and SPID
in NCCU_turns_utterance
and DOC_SPID
in NCCU_meta
) and therefore study the linguistic variation across speakers of different demographic backgrounds. The steps are as follows:
- We first extract the patterns we are interested in from
NCCU_turns_utterance
; - We then connect the concordance tokens to their corresponding SPID profiles in
NCCU_meta
; - We analyze how the patterns vary according to speakers of different profiles.
%>%
NCCU_turns_utterance filter(doc_id == "M001.cha")
For example, we can look at bigram usage patterns by speakers of varying age groups. The analysis requires the following steps:
- We retrieve target bigrams from
NCCU_bigrams
- We generate new indices
DOC_SPID
for all bigram tokens extracted - We map the
DOC_SPID
toNCCU_meta
to get the speaker profiles of each bigram token usingleft_join()
- We recode the speaker’s age into a three-level factor for more comprehensive analysis (i.e.,
AGE_GROUP
) - For each age group, we compute the bigram frequencies and dispersion (i.e., the number of speakers using the bigram)
## Analyzing bigram use by age
<- NCCU_bigrams %>% ## bigram-based DF
NCCU_bigrams_with_meta filter(!str_detect(bigrams, "<")) %>% ## remove bigrams with para tags
mutate(DOC_SPID = str_c(doc_id, ## Create `DOC_SPID`
str_replace_all(SPID, "\\*", ""),
sep = "_")) %>%
left_join(NCCU_meta, ## Link to meta DF
by = c("DOC_SPID" = "DOC_SPID")) %>%
mutate(AGE_GROUP = cut( ## relevel AGE
AGE,breaks = c(0, 20, 40, 60),
label = c("Below_20", "AGE_20_40", "AGE_40_60")
))
## Check
head(NCCU_bigrams_with_meta,50)
## Bigram Frequency & Dispersion List by Age_Group
<- NCCU_bigrams_with_meta %>%
NCCU_bigrams_by_age count(bigrams, AGE_GROUP, DOC_SPID) %>%
group_by(bigrams, AGE_GROUP) %>%
summarize(freq = sum(n), ## frequency
dispersion = n()) %>% ## dispersion (speakers)
filter(dispersion >= 5) %>% ## dispersion cutoff
ungroup
## Bigram freq & dispersion by age
NCCU_bigrams_by_age
Exercise 10.4 Based on the above bigrams from each age group (whose dispersion >= 5), please identify the top 20 distinctive bigrams from each age group and visualize them in a bar plot.
The distinctive bigrams are defined as follows:
For each bigram type, identity their observed frequencies in respective age groups.
Compute each bigram’s expected frequencies in each age group. We know that the total number of bigram tokens from each age group is: 3366 (
Below_20
), 37981 (AGE_20_40
), 1287 (AGE_40_60
). Therefore, the bigram sample sizes of different age groups vary a lot. We therefore expect that each bigram’s frequency would be distributed in these three age groups according to the proportions of each age group’s bigram sample size. So, the expected frequency of a bigram is calculated as follow (\(N_i\) refers to the total number of the bigram \(i\)):\[\text{Below_20_E} = N_i \times \frac{3366}{3366+37981+1287}\] \[\text{AGE_20_40_E} = N_i \times \frac{37981}{3366+37981+1287}\] \[\text{AGE_40_60_E} = N_i \times \frac{1287}{3366+37981+1287}\]
Calculate each bigram’s sum of normalized squared differences between the observed and the expected frequencies using the following formula ( \(O_i\) and \(E_i\) refers to the bigram’s observed/expected frequency of each age group):
\[ χ^2 = \sum{\frac{(O_i – E_i)^2}{E_i}} \]
- The above \(\chi^2\) statistics (i.e., the sum of all normalized squared differences) is known as Chi-square statistic. Use this metric to rank the bigrams for each age group and identity the top 20 distinctive bigrams from each age group. That is, identity the top 20 bigrams according to the bigrams’ Chi-square values for each age group.
- Determine each bigram’s preferred age group by two criteria: (a) the normalized squared differences in a specific age group (i.e., \(\frac{(O_i – E_i)^2}{E_i}\)) indicates how much the observed frequency deviates from the expected; (b) the raw difference between \(O_i\) and \(E_i\) indicates whether the age group is a preferred one or a repulsed one (i.e., if in the age group \(O_i\) is greater than \(E_i\), it is likely a preferred age group).
- In other words, the preferred age group is defined as the age group (a) where \(O_i\) is greater than \(E_i\), AND (b) its normalized squared difference is the greatest among all.
- Visualize these distinctive bigrams in bar plots as shown below. In the bar plot, please show the bigram’s observed frequency in its preferred age group.
The sample data frame below shows the expected frequencies (i.e., the columns of Below_20_E
, AGE_20_40_E
, AGE_40_60_E
) and the Chi-square values (i.e., X2
) and the PREFERRED_AGE
for each bigram. The table is sorted by X2
.
Exercise 10.5 Please create a barplot, showing the top 20 distinctive trigrams of male or female speakers. Similar to the bigram example in the lecture notes, please consider only trigrams:
- that do not include paralinguistic tags;
- that have been used by at least FIVE different male or female speakers.
After removing the above irrelevant trigrams, the total numbers of trigram tokens are : 5396 for female speakers, and 512 for male speakers.
Follow the same principles specified in the previous exercise and identify the top 20 distinctive trigrams of each gender ranked according to the Chi-square values. Visualize these distinctive trigrams in bar plots as shown below. In the bar plot, please show the trigram’s observed frequency in its preferred gender group.
The sample data frame below shows the expected frequencies (i.e., the columns of female_E
, male_E
) and the Chi-square values (i.e., X2
) for each trigram The table is sorted by X2
.