Chapter 12 Vector Space Representation

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)

In this chapter, I would like to talk about the idea of distributional semantics, which features the hypothesis that the meaning of a linguistic unit is closely connected to its co-occurring contexts (co-texts).

I will show you how this idea can be operationalized computationally and quantified using the distributional data of the linguistic units in the corpus.

Because English and Chinese text processing requires slightly different procedures, this chapter will first focus on English texts.

12.1 Distributional Semantics

Distributional approach to semantics was first formulated by John Firth in his famous quotation:

You shall know a word by the company it keeps (Firth, 1957, p. 11).

In other words, words that occur in the same contexts tend to have similar meanings (Z. Harris, 1954).

[D]ifference of meaning correlates with difference of distribution. (Z. S. Harris, 1970, p. 785)

The meaning of a [construction] in the network is represented by how it is linked to other words and how these are interlinked themselves. (De Deyne et al., 2016)

In computational linguistics, this idea forms the basis of distributional semantics. The central assumption is simple: we can represent meaning by looking at patterns of co-occurrence in large corpora. In other words, words and documents gain meaning from the company they keep.

This idea supports two closely related applications. First, it helps us model lexical semantics, where the focus is on word meaning. Second, it helps us model document semantics, where the focus is on topics.

  • Lexical Semantics (Word Level)
    • Vector Construction: We select a target word and map its contextual features by tracking the words that frequently appear near it within a specified window. These co-occurrence frequencies form a distributional profile for that word.
    • Semantic Inference: By comparing these profiles across different words, we can quantify their semantic similarity. Words that share similar contextual environments are inferred to have similar meanings, meaning that semantic representations emerge directly from usage rather than fixed definitions.
  • Document Semantics (Topic Level)
    • Vector Construction: We shift the analytical focus from individual tokens to entire texts. Each document is represented by its lexical distribution, capturing which words appear in the text and how often.
    • Topical Inference: By comparing these word distributions across different texts, we can identify document relationships. Texts that share similar patterns of word choice are inferred to cover similar topical structures.

This shared distributional logic underpins Vector Space Semantics: capturing localized lexical meaning at the word level and broader thematic topics at the document level.

In short, we can represent the semantics of a linguistic unit based on its co-occurrence patterns with specific contextual features.

If two words co-occur with similar sets of contextual features, they are likely to be semantically similar.

With the vectorized representations of words, we can compute their semanitc distances.


Please check Baroni et al. (2014) for a critical comparison between count-based and prediction-based distributional models, and Lenci et al. (2022) for a comprehensive discussion on distribtional semantics models of different generations.

12.2 Vector Space Model for Documents

Now I would like to demonstrate how we can adopt this vector space model to examine the semantics of documents.

12.2.1 Data Processing Flowchart

In Chapter 5, I have provided a data processing flowchart for the English texts. Here I would like to add to the flowchart several follow-up steps with respect to the vector-based representation of the corpus documents.

Most importantly, a new object class is introduced in Figure 12.1, i.e., the dfm object in quanteda. It stands for Document-Feature-Matrix. It’s a two-dimensional co-occurrence table, with the rows being the documents in the corpus, and columns being the features used to characterize the documents. The cells in the matrix are the co-occurrence statistics between each document and the feature.

Different ways of operationalizing the features and the cell values may lead to different types of dfm. In this section, I would like to show you how to create dfm of a corpus and what are the common ways to define features and cell values for the analysis of document semantics via vector space representation.


Figure 12.1: English Text Analytics Flowchart (v2)


12.2.2 Document-Feature Matrix (dfm)

To create a dfm, i.e., Dcument-Feature-Matrix, of your corpus data, there are generally three steps:

  • Create an corpus object of your data;
  • Tokenize the corpus object into a tokens object;
  • Create the dfm object based on the tokens object

In this tutorial, I will use the same English dataset as we discussed in Chapter 5, the data_corpus_inaugural, preloaded in the package quanteda.

For English data, the process is simple: we first load the corpus and create a dfm object of the corpus using dfm().

## `corpus` 
corp_us <- data_corpus_inaugural
## `tokens`
corp_us_tokens <- tokens(corp_us)
## `dfm`
corp_us_dfm <- dfm(corp_us_tokens)

## check dfm
corp_us_dfm
Document-feature matrix of: 60 documents, 9,591 features (91.94% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives :
  1789-Washington               1  71 116      1  48     2               2 1
  1793-Washington               0  11  13      0   2     0               0 1
  1797-Adams                    3 140 163      1 130     0               2 0
  1801-Jefferson                2 104 130      0  81     0               0 1
  1805-Jefferson                0 101 143      0  93     0               0 0
  1809-Madison                  1  69 104      0  43     0               0 0
                 features
docs              among vicissitudes
  1789-Washington     1            1
  1793-Washington     0            0
  1797-Adams          4            0
  1801-Jefferson      1            0
  1805-Jefferson      7            0
  1809-Madison        0            0
[ reached max_ndoc ... 54 more documents, reached max_nfeat ... 9,581 more features ]
## Check object class
class(data_corpus_inaugural)
[1] "corpus"    "character"
class(corp_us)
[1] "corpus"    "character"
class(corp_us_dfm)
[1] "dfm"
attr(,"package")
[1] "quanteda"

12.2.3 Intuition for DFM

What is dfm anyway?

A document-feature-matrix is a simple co-occurrence table.

In a dfm, each row refers to a document in the corpus, and the column refers to a linguistic unit that occurs in the document(s) (i.e., the contextual features in the vector space model).

  • If the contextual feature is a word, then this dfm would be a document-word-matrix, with the columns referring to all the words observed in the corpus, i.e., the vocabulary of the corpus.
  • If the contextual feature is an n-gram, then this dfm would be a document-ngram-matrix, with the columns referring to all the n-grams observed in the corpus.

What about the values in the cells of dfm?

  • The most intuitive values in the cells are the co-occurrence frequencies, i.e., the number of occurrences of the contextual feature (i.e., column) in a particular document (i.e., row).

For example, in the corpus data_corpus_inaugural, based on the corp_us_dfm created earlier, we can see that in the first document, i.e., 1789-Washington, there are 2 occurrences of “representatives”, 48 occurrences of “and”.

Bag of Words Representation

By default, the dfm() would create a document-by-word matrix, i.e., generating words as the contextual features.

A dfm with words as the contextual features is the simplest way to characterize the documents in the corpus, namely, to analyze the semantics of the documents by looking at the words occurring in the documents.

This document-by-word matrix treats each document as bags of words. That is, how the words are arranged relative to each other is ignored (i.e., the morpho-syntactic relationships between words in texts are greatly ignored). Therefore, this document-by-word dfm should be the most naive characterization of the texts.

In many computational tasks, however, it turns out that this simple bag-of-words model is very effective in modeling the semantics of the documents.

12.2.4 More sophisticated DFM

The dfm() with the default settings creates a document-by-word matrix. We can create more sophisticated DFM by refining the contextual features.

A document-by-ngram matrix can be more informative because the contextual features take into account (partial & limited) sequential information between words. In quanteda, to obtain an ngram-based dfm:

  • Create an ngram-based tokens using tokens_ngrams() first;
  • Create an ngram-based dfm based on the ngram-based tokens;
## create ngrams tokens
corp_us_ngrams <- tokens_ngrams(corp_us_tokens,
              n = 2:3)

## create ngram-based dfm
corp_us_ngrams_dfm <- dfm(corp_us_ngrams)

12.2.5 Distance/Similarity Metrics

The advantage of creating a document-feature-matrix is that now each document is not only a series of character strings, but also a list of numeric values (i.e., a row of co-occurring frequencies), which can be compared mathematically with the other documents (i.e., the other rows).

For example, now the document 1789-Washington can be represented as a series of numeric values:

## contextual features of `1789-Washington`
corp_us_dfm[1,1:20] %>%
  as.vector ## only the first 20 dimensions of the vector
 [1]   1  71 116   1  48   2   2   1   1   1   1  48   1   8   2   3  12   1   8
[20]  17

The idea is that if two documents co-occur with similar sets of contextual features, they are more likely to be similar in their semantics as well.

And now with the numeric representation of the documents, we can quantify these similarities.


Take a two-dimensional space for instance.

Let’s assume that we have three simple documents: \(x\), \(y\), \(z\), and each document is vectorized as a vector of two numeric values.

x <- c(1,9)
y <- c(1,3)
z <- c(5,1)

If we visualize these document vectors in a two-dimensional space, we can compute their distance/similarity mathematically.

In Math, there are in general two types of metrics to measure the relationship between vectors: distance-based vs. similarity-based metrics.

Vector Representation

Figure 12.2: Vector Representation

Distance-based Metrics

Many distance measures of vectors are based on the following formula and differ in the parameter \(k\).

\[\big( \sum_{i = 1}^{n}{|x_i - y_i|^k}\big)^{\frac{1}{k}}\]

The n in the above formula refers to the number of dimensions of the vectors. (In other words, all the concepts we discuss here can be easily extended to vectors in multidimensional spaces.)

When k is set to 2, it computes the famous Euclidean distance of two vectors, i.e., the direct spatial distance between two points on the n-dimensional space (Remember Pythagorean Theorem? )

\[\sqrt{\big( \sum_{i = 1}^{n}{|x_i - y_i|^2}\big)}\]

  • When \(k = 1\), the distance is referred to as Manhattan Distance or City Block Distance:

\[\sum_{i = 1}^{n}{|x_i - y_i|}\]

  • When \(k\geq3\), the distance is a generalized form, called, Minkowski Distance at the order of \(k\).

  • In other words, \(k\) represents the order of norm. The Minkowski Distance of the order 1 is the Manhattan Distance; the Minkowski Distance of the order 2 is the Euclidean Distance.

## Create vectors
x <- c(1,9)
y <- c(1,3)
z <- c(5,1)

## computing pairwise euclidean distance
sum(abs(x-y)^2)^(1/2) # XY distance
[1] 6
sum(abs(y-z)^2)^(1/2) # YZ distnace
[1] 4.472136
sum(abs(x-z)^2)^(1/2) # XZ distnace
[1] 8.944272

The geometrical meanings of the Euclidean distance are easy to conceptualize (c.f., the dashed lines in Figure 12.3)

Distance-based Metric: Euclidean Distance

Figure 12.3: Distance-based Metric: Euclidean Distance

Similarity-based Metrics

In addition to distance-based metrics, we can measure the similarity of vectors using a similarity-based metric, which often utilizes the idea of correlations. The most commonly used one is Cosine Similarity, which can be computed as follows:

\[cos(\vec{x},\vec{y}) = \frac{\sum_{i=1}^{n}{x_i\times y_i}}{\sqrt{\sum_{i=1}^{n}x_i^2}\times \sqrt{\sum_{i=1}^{n}y_i^2}}\]

# comuting pairwise cosine similarity
sum(x*y)/(sqrt(sum(x^2))*sqrt(sum(y^2))) # xy
[1] 0.9778024
sum(y*z)/(sqrt(sum(y^2))*sqrt(sum(z^2))) # yz
[1] 0.4961389
sum(x*z)/(sqrt(sum(x^2))*sqrt(sum(z^2))) # xz
[1] 0.3032037

The Cosine Similarity ranges from -1 (= the least similar) to 1 (= the most similar).

The geometric meanings of cosines of two vectors are connected to the arcs between the vectors: the greater their cosine similarity, the smaller the arcs, the closer they are.


Cosine Similarity is related to Pearson Correlation. If you would like to know more about their differences, take a look at this comprehensive blog post, Cosine Similarity VS Pearson Correlation Coefficient.

Therefore, it is clear to see that cosine similarity highlights the documents similarities in terms of whether their values on all the dimensions (contextual features) vary in the same directions.

However, distance-based metrics would highlight the document similarities in terms of how much their values on all the dimensions differ.

Computing pairwise distance/similarity using quanteda

In quanteda.textstats library, there are two main functions that can help us compute pairwise similarities/distances between vectors using two useful functions:

  • textstat_simil(): similarity-based metrics
  • textstat_dist(): distance-based metrics

The expected input argument of these two functions is a dfm and they compute all pairwise distance/similarity metrics in-between the rows of the DFM (i.e., the documents).

## Create a simple DFM
xyz_dfm <- as.dfm(matrix(c(1,9,1,3,5,1), 
                         byrow=T, 
                         ncol=2))

## Rename documents
docnames(xyz_dfm) <- c("X","Y","Z")

## Check
xyz_dfm
Document-feature matrix of: 3 documents, 2 features (0.00% sparse) and 0 docvars.
    features
docs feat1 feat2
   X     1     9
   Y     1     3
   Z     5     1
## Computing cosine similarity
textstat_simil(xyz_dfm, method="cosine")
textstat_simil object; method = "cosine"
      X     Y     Z
X 1.000 0.978 0.303
Y 0.978 1.000 0.496
Z 0.303 0.496 1.000
## Computing euclidean distance
textstat_dist(xyz_dfm, method = "euclidean")
textstat_dist object; method = "euclidean"
     X    Y    Z
X    0 6.00 8.94
Y 6.00    0 4.47
Z 8.94 4.47    0

There are other useful functions in R that can compute the pairwise distance/similarity metrics on a matrix. When using these functions, please pay attention to whether they provide distance or similarity metrics because these two are very different in meanings.

For example, amap::Dist() provides cosine-based distance, not similarity.

## Compute cosine distance with amap::Dist()
amap::Dist(xyz_dfm, method = "pearson")
           X          Y
Y 0.02219759           
Z 0.69679634 0.50386106
## Compute cosine similarity with amap::Dist()
(1- amap::Dist(xyz_dfm, method="pearson"))
          X         Y
Y 0.9778024          
Z 0.3032037 0.4961389

Interim Summary

  1. The Euclidean Distance metric is a distance-based metric: the larger the value, the more distant the two vectors.
  2. The Cosine Similarity metric is a similarity-based metric: the larger the value, the more similar the two vectors.

Based on our computations of the metrics for the three vectors, now in terms of the Euclidean Distance, y and z are closer; in terms of Cosine Similarity, x and y are closer.

Therefore, it should now be clear that the analyst needs to decide which metric to use, or more importantly, which metric is more relevant. The key is which of the following is more important in the semantic representation of the documents/words:

  • The absolute value differences that the vectors have on each dimension (i.e., the lengths of the vectors)
  • The relative increase/decrease of the values on each dimension (i.e., the curvatures of vectors)

There are many other distance-based or similarity-based metrics available. For more detail, please see Manning & Schütze (1999) Ch15.2.2. and Jurafsky & Martin (2020) Ch6: Vector Semantics and Embeddings.

12.2.6 Multidimensional Space

Back to our example of corp_us_dfm, it is essentially the same vector representation, but in a multidimensional space (cf. Figure 12.4). The document in each row is represented as a vector of N dimensional space. The size of N depends on the number of contextual features that are included in the analysis of the dfm.


Example of Document-Feature Matrix

Figure 12.4: Example of Document-Feature Matrix

12.2.7 Feature Selection

A dfm may not be as informative as we have expected. To better capture the document’s semantics/topics, there are several important factors that need to be more carefully considered with respect to the contextual features of the dfm:

  • The granularity of the features
  • The informativeness of the features
  • The distributional properties of the features

Granularity

In our previous example, we include only words, i.e., unigrams, as our contextual features in the corp_us_dfm. We can in fact include linguistic units at multiple granularities:

  • raw word forms (dfm() default contextual features)
  • (skipped) n-grams
  • lemmas/stems
  • Specific syntactic categories (e.g., lexical words)

For example, if you want to include bigrams, not unigrams, as features in the dfm, you can do the following:

  1. from corpus to tokens
  2. from tokens to ngram-based tokens
  3. from ngram-based tokens to dfm
## Create DFM based on bigrams
corp_us_dfm_bigram <- dfm(tokens_ngrams(corp_us_tokens, n =2))

Or for English data, if you want to ignore the stem variations between words (i.e., house and houses may not be differ so much), you can do it this way:

## Create DFM based on word stems
corp_us_dfm_unigram_stem <- dfm_wordstem(corp_us_dfm)

We can of course create DFM based on stemmed bigrams:

## Create DFM based on stemmed bigrams
corp_us_dfm_bigram_stem <- corp_us_tokens %>%  ## tokens
  tokens_ngrams(n = 2) %>% ## bigram tokens
  dfm() %>% ## dfm
  dfm_wordstem() ## stem

You need to decide which types of contextual features are more relevant to your research question. In many text mining applications, people often make use of both unigrams and n-grams. However, these are only heuristics, not rules.

Exercise 12.1 Based on the dataset corp_us, can you create a dfm, where the features are trigrams but all the words in the trigrams are word stems not the original surface word forms? (see below)


Informativeness

There are words that are not so informative in telling us the similarity and difference between the documents because they almost appear in every document of the corpus, but carry little (referential) semantic contents.

Typical tokens include:

  • Stopwords
  • Numbers
  • Symbols
  • Control characters
  • Punctuation marks
  • URLs

The library quanteda has defined a default English stopword list, i.e., stopwords("en").

## English stopword
stopwords("en") %>% head(50)
 [1] "i"          "me"         "my"         "myself"     "we"        
 [6] "our"        "ours"       "ourselves"  "you"        "your"      
[11] "yours"      "yourself"   "yourselves" "he"         "him"       
[16] "his"        "himself"    "she"        "her"        "hers"      
[21] "herself"    "it"         "its"        "itself"     "they"      
[26] "them"       "their"      "theirs"     "themselves" "what"      
[31] "which"      "who"        "whom"       "this"       "that"      
[36] "these"      "those"      "am"         "is"         "are"       
[41] "was"        "were"       "be"         "been"       "being"     
[46] "have"       "has"        "had"        "having"     "do"        
length(stopwords("en"))
[1] 175
  • To remove stopwords from the dfm, we can use dfm_remove() on the dfm object.

  • To remove numbers, symbols, or punctuation marks, we can further specify a few parameters for the function tokens():

    • remove_punct = TRUE: remove all characters in the Unicode “Punctuation” [P] class
    • remove_symbols = TRUE: remove all characters in the Unicode “Symbol” [S] class
    • remove_url = TRUE: find and eliminate URLs beginning with http(s)
    • remove_separators = TRUE: remove separators and separator characters (Unicode “Separator” [Z] and “Control” [C] categories)
## Create new DFM 
## w/o non-word tokens and stopwords
corp_us_dfm_unigram_stop_punct <- corp_us %>% 
  tokens(remove_punct = TRUE, ## remove non-word tokens
         remove_symbols = TRUE,
         remove_numbers = TRUE,
         remove_url = TRUE) %>%
  dfm() %>% ## Create DFM
  dfm_remove(stopwords("en")) ## Remove stopwords from DFM

We can see that the number of features varies a lot when we operationalize the contextual features differently:

## Changes of Contextual Feature Numbers
nfeat(corp_us_dfm) ## default unigram version
[1] 9591
nfeat(corp_us_dfm_unigram_stem) ## unigram + stem
[1] 5693
nfeat(corp_us_dfm_bigram) ## bigram
[1] 65616
nfeat(corp_us_dfm_bigram_stem) ## bigram + stem
[1] 59086
nfeat(corp_us_dfm_unigram_stop_punct) ## unigram removing non-words/puncs
[1] 9359

Distributional Properties

Finally, we can also select contextual features based on their distributional properties. For example:

  • (Term) Frequency: the contextual feature’s frequency counts in the corpus
    • Set a cut-off minimum to avoid hapax legomenon or highly idiosyncratic words
    • Set a cut-off maximum to remove high-frequency function words, carrying little semantic content.
  • Dispersion (Document Frequency)
    • Set a cut-off minimum to avoid idiosyncratic or domain-specific words;
    • Set a cut-off maximum to avoid stopwords;
  • Other Self-defined Weights
    • We can utilize association-based metrics to more precisely represent the association between a contextual feature and a document (e.g., PMI, LLR, TF-IDF).

In quanteda, we can easily specify our distributional cutoffs for contextual features:

  • Frequency: dfm_trim(DFM, min_termfreq = ..., max_termfreq = ...)
  • Dispersion: dfm_trim(DFM, min_docfreq = ..., max_docfreq = ...)
  • Weights: dfm_weight(DFM, scheme = ...) or dfm_tfidf(DFM)

When specifying feature selection thresholds, the arguments termfreq_type and docfreq_type determine how the respective min_* and max_* limits are interpreted:

  • "count": Interprets the cutoff values as absolute frequency counts.
  • "prop": Interprets the cutoff values as proportions, dividing the raw frequencies by the total number of tokens in the collection.
  • "rank": Matches the cutoff values against an inverted frequency ranking. A value of 1 represents the most frequent feature, 2 represents the second most frequent, and so on.
  • "quantile": Sets the cutoffs dynamically based on the mathematical quantiles of the frequency distribution.

Exercise 12.2 Please read the documentations of dfm_trim(), dfm_weight(), dfm_tfidf() very carefully and make sure you know how to use them.


In the following demo, we adopt a few strategies to refine the contextual features of the dfm:

  • we create a simple unigram dfm based on the word-forms
  • we remove stopwords, punctuation marks, numbers, and symbols
  • we include contextual words whose termfreq \(\geq\) 10 and docfreq \(\geq\) 3
## Create trimmed DFM
corp_us_dfm_trimmed <- corp_us %>%  ## corpus
  tokens( ## tokens
    remove_punct = T, 
    remove_numbers = T,
    remove_symbols = T
  ) %>%
  dfm() %>% ## dfm
  dfm_remove(stopwords("en")) %>%  ## remove stopwords
  dfm_trim(
    min_termfreq = 10,  ## frequency
    max_termfreq = NULL,
    termfreq_type = "count",
    min_docfreq = 3,  ## dispersion
    max_docfreq = NULL,
    docfreq_type = "count"
  ) 
nfeat(corp_us_dfm_trimmed)
[1] 1429

12.2.8 Exploratory Analysis of dfm

  • We can check the top features in the current corpus:
## Check top important contextual features
topfeatures(corp_us_dfm_trimmed, n = 20)
    people government         us        can       must       upon      great 
       592        575        507        489        377        371        354 
       may     states      world     nation    country      shall      every 
       343        343        329        324        323        316        309 
       one      peace        new      power        now     public 
       274        259        256        246        238        227 
  • We can visualize the top features using a word cloud:
## Create wordcloud

## Set random seed 
## (to make sure consistent results for every run)
set.seed(100)

## Color brewer
require(RColorBrewer)


## Plot wordcloud
textplot_wordcloud(
  corp_us_dfm_trimmed,
  max_words = 200,
  random_order = FALSE,
  rotation = .25,
  color =  brewer.pal(7, "Dark2")
)

12.2.9 Document Similarity

As shown in 12.4, with the N-dimensional vector representation of each document, we can compute the mathematical distances/similarities between two documents.

In Section 12.2.5, we introduced two important metrics:

  • Distance-based metric: Euclidean Distance
  • Similarity-based metric: Cosine Similarity

quanteda provides useful functions to compute these metrics (as well as other alternatives): textstat_simil() and textstat_dist()


Before computing the document similarity/distance, we usually convert the frequency counts in dfm into more sophisticated metrics using normalization or weighting schemes.

There are two common schemes:

  • Normalized Term Frequencies
  • TF-IDF Weighting Scheme

Normalized Term Frequencies

  • We can normalize these term frequencies into percentages to reduce the impact of the document size (i.e., marginal frequencies) on the significance of the co-occurrence frequencies.
## Intuition for Normalized Frequencies

## Raw Frequency Counts
corp_us_dfm_trimmed[1,1:5] 
Document-feature matrix of: 1 document, 5 features (0.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house representatives among
  1789-Washington               1      1     2               2     1
## Normalized Frequencies
dfm_weight(corp_us_dfm_trimmed, 
           scheme="prop")[1,1:5]
Document-feature matrix of: 1 document, 5 features (0.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens      senate       house representatives
  1789-Washington     0.002421308 0.002421308 0.004842615     0.004842615
                 features
docs                    among
  1789-Washington 0.002421308
## Intuition
corp_us_dfm_trimmed[1,1:5]/sum(corp_us_dfm_trimmed[1,])
Document-feature matrix of: 1 document, 5 features (0.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens      senate       house representatives
  1789-Washington     0.002421308 0.002421308 0.004842615     0.004842615
                 features
docs                    among
  1789-Washington 0.002421308

TF-IDF Weighting SCheme

  • A more sophisticated weighting scheme is TF-IDF scheme. We can weight the significance of term frequencies based on the term’s IDF.

Inverse Document Frequencies

  • For each contextual feature, we can also assess their distinctive power in terms of their dispersion across the entire corpus.
  • In addition to document frequency counts, there is a more effective metric, Inverse Document Frequency(IDF) (often used in Information Retrieval), to capture the distinctiveness of the features.
  • The IDF of a term (\(w_i\)) in a corpus (\(D\)) is defined as the log-transformed ratio of the corpus size (\(|D|\)) to the term’s document frequency (\(df_i\)).
  • The more widely dispersed the contextual feature is, the lower its IDF; the less dispersed the contextual feature, the higher its IDF.

\[ IDF(w_i, D) = log_{10}{\frac{|D|}{df_i}} \]

## Intuition for Inverse Document Frequency

## Docfreq of the first ten features
docfreq(corp_us_dfm_trimmed)[1:10]
fellow-citizens          senate           house representatives           among 
             19               9               8              14              43 
           life           event         greater           order        received 
             50               9              30              30              10 
## Inverse Doc
docfreq(corp_us_dfm_trimmed,
        scheme = "inverse",
        base = 10)[1:10]
fellow-citizens          senate           house representatives           among 
     0.49939765      0.82390874      0.87506126      0.63202321      0.14468279 
           life           event         greater           order        received 
     0.07918125      0.82390874      0.30103000      0.30103000      0.77815125 
## Intuition
corpus_size <- ndoc(corp_us_dfm_trimmed)
log(corpus_size/docfreq(corp_us_dfm_trimmed), 10)[1:10]
fellow-citizens          senate           house representatives           among 
     0.49939765      0.82390874      0.87506126      0.63202321      0.14468279 
           life           event         greater           order        received 
     0.07918125      0.82390874      0.30103000      0.30103000      0.77815125 
  • The TFIDF of a term (\(w_i\)) in a document (\(d_j\)) is the product of the word’s term frequency (TF) in the document (\(tf_{ij}\)) and the IDF of the term (\(log\frac{|N|}{df_i}\)).

\[ TFIDF(w_i, d_j) = tf_{ij} \times log\frac{|N|}{df_i} \]

## Intuition for TFIDF

## Raw Frequency Counts
corp_us_dfm_trimmed[1,1:5] 
Document-feature matrix of: 1 document, 5 features (0.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house representatives among
  1789-Washington               1      1     2               2     1
## TF-IDF
dfm_tfidf(corp_us_dfm_trimmed)[1,1:5]
Document-feature matrix of: 1 document, 5 features (0.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens    senate    house representatives     among
  1789-Washington       0.4993976 0.8239087 1.750123        1.264046 0.1446828
## Intuition
TF_ex <- corp_us_dfm_trimmed[1,]
IDF_ex <- docfreq(corp_us_dfm_trimmed, scheme= "inverse")
TFIDF_ex <- TF_ex*IDF_ex

TFIDF_ex[1,1:5]
Document-feature matrix of: 1 document, 5 features (0.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens    senate    house representatives     among
  1789-Washington       0.4993976 0.8239087 1.750123        1.264046 0.1446828

We weight the dfm with the TF-IDF scheme before the document similarity analysis.

## weight DFM
corp_us_dfm_trimmed_tfidf <- dfm_tfidf(corp_us_dfm_trimmed)

## top features of count-based DFM
topfeatures(corp_us_dfm_trimmed, 20)
    people government         us        can       must       upon      great 
       592        575        507        489        377        371        354 
       may     states      world     nation    country      shall      every 
       343        343        329        324        323        316        309 
       one      peace        new      power        now     public 
       274        259        256        246        238        227 
## top features of tfidf-based DFM
topfeatures(corp_us_dfm_trimmed_tfidf, 20)
     america        union     congress      freedom         upon constitution 
    60.06028     51.87024     41.04792     39.47051     39.34581     39.28820 
   americans         laws        today    democracy      revenue      federal 
    37.47811     35.49017     34.61845     34.45844     33.95754     33.73896 
      states     business       public    executive        thank   government 
    33.24013     32.94136     32.84299     32.76833     31.45365     30.97834 
      policy          let 
    28.67896     28.52678 
  • Distance-based Results (Euclidean Distance)
## Distance-based: Euclidean
corp_us_euclidean <- textstat_dist(corp_us_dfm_trimmed_tfidf,
                                   method = "euclidean")
  • Cosine-based Results (Cosine Similarity)
## Similarity-based: Cosine
corp_us_cosine <- textstat_simil(corp_us_dfm_trimmed_tfidf,
                                 method = "cosine")

As we have discussed in the previous sections, different distance/similarity metrics may give very different results. As an analyst, we can go back to the original documents and evaluate these quantitative results in a qualitative way.

For example, let’s manually check the document that is the most similar to 1789-Washington according to Euclidean Distance and Cosine Similarity:

## Closest document based on euclidean
which.min(corp_us_euclidean[-1,1])
1793-Washington 
              1 
## Closest document based on cosine
which.max(corp_us_cosine[-1,1])
1841-Harrison 
           13 
## You can further inspect the texts
as.character(corp_us["1789-Washington"]) %>% strtrim(400)
                                                                                                                                                                                                                                                                                                                                                                                                       1789-Washington 
"Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat wh" 
as.character(corp_us["1793-Washington"]) %>% strtrim(400)
                                                                                                                                                                                                                                                                                                                                                                                                       1793-Washington 
"Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Con" 
as.character(corp_us["1841-Harrison"]) %>% strtrim(400)
                                                                                                                                                                                                                                                                                                                                                                                                     1841-Harrison 
"Called from a retirement which I had supposed was to continue for the residue of my life to fill the chief executive office of this great and free nation, I appear before you, fellow-citizens, to take the oaths which the Constitution prescribes as a necessary qualification for the performance of its duties; and in obedience to a custom coeval with our Government and what I believe to be your expec" 

12.2.10 Cluster Analysis

The pairwise distance/similarity matrices are sometimes less comprehensive. We can visualize the document distances in a more comprehensive way by an exploratory technique called hierarchical cluster analysis.

## distance-based
corp_us_hist_euclidean <- corp_us_euclidean %>% 
  as.dist %>% ## DFM to dist
  hclust ## cluster analysis

## Plot dendrogram
# plot(corp_us_hist_euclidean, hang = -1, cex = 0.6)

require("ggdendro")
ggdendrogram(corp_us_hist_euclidean, 
             rotate = TRUE, 
             theme_dendro = TRUE)

## similarity
corp_us_hist_cosine <- (1 - corp_us_cosine) %>% ## similarity to distance
  as.dist %>% 
  hclust

## Plot dendrogram
# plot(corp_us_hist_cosine, hang = -1, cex = 0.6)
ggdendrogram(corp_us_hist_cosine, 
             rotate = TRUE, 
             theme_dendro = TRUE)

Please note that textstat_simil() gives us the similarity matrix. In other words, the numbers in the matrix indicate how similar the documents are. However, for hierarchical cluster analysis, the function hclust() expects a distance-based matrix, namely one indicating how dissimilar the documents are. Therefore, we need to use (1 - corp_us_cosine) in the cosine example before performing the cluster analysis.

Cluster anlaysis is a very useful exploratory technique to examine the emerging structure of a large dataset. For more detail introduction to this statistical method, I would recommend Gries (2013) Ch 5.6 and the very nice introductory book, Kaufman & Rousseeuw (1990).

For more information on vector-space semantics, I would highly recommend the chapter of Vector Semantics and Embeddings, by Dan Jurafsky and James Martin.

12.3 Vector Space Model for Words

So far, we have been talking about applying the vector space model to study the document semantics.

Now let’s take a look at how this distributional semantic approach can facilitate a lexical semantic analysis.

With a corpus, we can also study the distribution, or contextual features, of (target) words based on their co-occurring (contextual) words. Now I would like to introduce another object defined in quanteda, i.e., the Feature-Cooccurrence Matrix fcm.

12.3.1 Feature-Coocurrence Matrix (fcm)

A Feature-Cooccurrence Matrix is essentially a word co-occurrence matrix (i.e., a square matrix).

12.3.2 From tokens to fcm

We can create a word co-occurrence matrix fcm from the tokens object.

We can further operationalize our contextual features for target words in two ways:

  • Window-based: Only words co-occurring within a defined window size of the target word will be included as its contextual features
  • Document-based: All words co-occurring in the same document as the target word will be included as its contextual features.

Window-based Contextual Features

## tokens object
corp_us_tokens <- tokens(
  corp_us,
  what = "word",
  remove_punct = TRUE,
  remove_symbol = TRUE,
  remove_numbers = TRUE
)

## window-based FCM
corp_fcm_win_2 <- fcm(
  corp_us_tokens, ## tokens
  context = "window", ## context type
  window = 2,
  tri = FALSE) # window size

## Check
corp_fcm_win_2
Feature co-occurrence matrix of: 10,238 by 10,238 features.
                 features
features          Fellow-Citizens   of  the Senate  and House Representatives
  Fellow-Citizens               0    2    2      0    0     0               0
  of                            2  168 4918      6  858     7               4
  the                           2 4918  208      8 1498    10               1
  Senate                        0    6    8      0    4     1               0
  and                           0  858 1498      4  296     2               0
  House                         0    7   10      1    2     0               4
  Representatives               0    4    1      0    0     4               0
  Among                         0    3    2      0    0     0               1
  vicissitudes                  0    1    3      0    2     0               0
  incident                      0    2    4      0    1     0               0
                 features
features          Among vicissitudes incident
  Fellow-Citizens     0            0        0
  of                  3            1        2
  the                 2            3        4
  Senate              0            0        0
  and                 0            2        1
  House               0            0        0
  Representatives     1            0        0
  Among               0            1        0
  vicissitudes        1            0        1
  incident            0            1        0
[ reached max_nfeat ... 10,228 more features, reached max_nfeat ... 10,228 more features ]

We can check the top N contextual features (i.e., co-occurring words) for specific lexical items:

## Check top 10 contextual features
## for specific target worods
corp_fcm_win_2["our",] %>% colSums %>% sort(.,decreasing=T) %>% .[1:20]
      of      and       to       in  country      own     that   people 
     702      390      305      203      128      105       90       89 
      is      for     with       we     will     from       by    which 
      87       86       72       69       66       48       46       44 
      as      all      our citizens 
      44       44       44       43 
corp_fcm_win_2["must",] %>% colSums %>% sort(.,decreasing=T) %>% .[1:20]
     be      we      We     and     the    that     not      it      of     our 
    132      87      47      37      36      26      26      25      18      18 
      a      to   which      do America      as      It    with     for     its 
     17      16      15      13      12      11      10       9       9       9 

Document-based Contextual Features

## Document-based FCM
corp_fcm_doc <- fcm(
  corp_us_tokens,
  context = "document",
  tri = FALSE)

## check
corp_fcm_doc
Feature co-occurrence matrix of: 10,238 by 10,238 features.
                 features
features          Fellow-Citizens      of     the Senate     and House
  Fellow-Citizens               0     710    1057      2     507     2
  of                          710  685106 1821173   2042  874373  1408
  the                        1057 1821173 1216870   2749 1163095  1968
  Senate                        2    2042    2749      3    1418    12
  and                         507  874373 1163095   1418  306258   909
  House                         2    1408    1968     12     909     3
  Representatives               2     996    1350      4     503     6
  Among                         1     514     688      1     348     2
  vicissitudes                  1     701     897      1     463     2
  incident                      2    1509    2052      2    1108     2
                 features
features          Representatives Among vicissitudes incident
  Fellow-Citizens               2     1            1        2
  of                          996   514          701     1509
  the                        1350   688          897     2052
  Senate                        4     1            1        2
  and                         503   348          463     1108
  House                         6     2            2        2
  Representatives               1     2            2        2
  Among                         2     0            2        1
  vicissitudes                  2     2            0        1
  incident                      2     1            1        2
[ reached max_nfeat ... 10,228 more features, reached max_nfeat ... 10,228 more features ]
## Check top 10 contextual features
## for specific target words
corp_fcm_doc["our",] %>% colSums %>% sort(.,decreasing=T) %>% .[1:20]
   the     of    and     to     in      a   that     is     be     we    for 
407448 310075 227567 191224 109821  96123  76512  65549  64529  58036  49577 
   our   will   have     by     it    not  which   with     as 
 49080  45587  44603  44279  43827  42218  39916  38685  37890 
corp_fcm_doc["must",] %>% colSums %>% sort(.,decreasing=T) %>% .[1:20]
  the    of   and    to    in     a   our  that    is    be    we   for   not 
72497 55522 41564 34625 20279 17914 17234 13313 12849 12154 11002  9689  8079 
   it  will    by  have   are    as which 
 7795  7680  7639  7601  7067  7027  6869 

In the above examples of fcm, we tokenize our corpus into tokens first and then use it to create the fcm. The advantage of our current method is that we can have a full control of the word tokenization, i.e., what tokens we would like to include in the fcm.

This can be particularly important when we deal with Chinese data.

12.3.3 From FCM to Collocations

Once we have the co-occurrence distributions of words within a fixed window, we can move beyond simple counts to calculate lexical associations. This allows us to identify meaningful collocations —- word pairs that appear together more often than would be expected by chance.

In the following workflow, we transform a Feature Co-occurrence Matrix (FCM) into a tidy format to calculate association measures like Mutual Information (MI) and the t-score.

Important Distraction Note: Window-Based FCM vs. Bigrams

It is important to distinguish this method from our previous collocation calculations using bigrams. A bigram approach restricts candidate pairs strictly to adjacent tokens (words that sit directly next to each other, like social + media).

In contrast, using a Feature Co-occurrence Matrix (FCM) with a sliding window allows us to capture wider contextual associations. Words do not need to be immediately adjacent to form a pair; as long as they appear within the specified window span (e.g., two words before or after the target), they are counted as co-occurring. This window-based approach captures flexible, non-contiguous collocations that a strict bigram approach would miss entirely.

  • Step 1: Converting FCM to a Tidy Format

First, we convert the fcm object into a data frame where each row represents a pair of co-occurring words.

## window-based FCM
corp_fcm_win_5 <- fcm(
  corp_us_tokens, ## tokens
  context = "window", ## context type
  window = 5) # window size


## Convert fcm to a data frame
corp_fcm_win_5_df <- tidytext::tidy(corp_fcm_win_5)
  
head(corp_fcm_win_5_df)
  • Step 2: Extracting Global Word Frequencies

To determine if a pair is “special,” we need to know how often each word appears individually across the entire corpus. We use dfm() and textstat_frequency() from quanteda to generate this reference list.

## Get lexical frequencies for the whole corpus
corp_word_freq_list <- corp_us_tokens %>% 
  dfm(tolower = FALSE) %>% 
  textstat_frequency
  • Step 3: Preparing the Contingency Table Data

We now align our co-occurrence counts (\(O_{11}\)) with the marginal frequencies of the individual words (\(R_1\) and \(C_1\)). We also filter out self-pairs and rare (low-frequency) combinations to reduce statistical noise.

## Prepare numbers for collocation analysis
corp_fcm_win_5_df <- corp_fcm_win_5_df %>%
  rename(w1 = document, w2 = term, O11 = count) %>%
  filter(w1 != w2) %>%   # Remove self-pairs (e.g., "the-the")
  filter(O11 >= 5) %>%   # Remove rare pairs to ensure statistical reliability
  mutate(
    R1 = corp_word_freq_list$frequency[match(w1, corp_word_freq_list$feature)],
    C1 = corp_word_freq_list$frequency[match(w2, corp_word_freq_list$feature)]
  )

head(corp_fcm_win_5_df)
# Total number of co-occurrence tokens
TOTAL_N <- sum(corp_fcm_win_5_df$O11)
  • Step 4: Calculating Association Measures

With our observed frequencies (\(O_{11}\)) and marginal totals ready, we calculate the Expected Frequency (\(E_{11}\)) and use it to derive association scores.

  • Mutual Information (MI): Highlights strongly bonded pairs, though it can be biased toward rare words.
  • t-score: Measures the confidence that the association is not due to chance, which is often better for identifying frequent, stable collocations.
## Calculate lexical associations
corp_collocations <- corp_fcm_win_5_df %>%
  # Compute the expected frequency of the bigram
  mutate(E11 = (R1 * C1) / TOTAL_N) %>% 
  # Compute Mutual Information and t-score
  mutate(
    MI = log2(O11 / E11),
    t = (O11 - E11) / sqrt(O11)
  ) %>% 
  arrange(desc(MI)) # Sort by strongest association

## Inspect ranked collocations for a specific keyword
corp_collocations %>%
  filter(w1 == "we") %>%
  arrange(desc(MI))

12.3.4 Lexical Similarity

fcm is an interesting structure because, similar to dfm, we can now examine the pairwise relationships in-between all target words.

In particular, based on this FCM, we can further analyze which target words tend to co-occur with similar sets of contextual words.

And based on these quantitative pairwise similarities in-between target words, we can visualize their lexical relations with a network or a cluster dendrogram.

## Create window size 5 FCM
corp_fcm_win_5 <- fcm(corp_us_tokens,
                      context = "window",
                      window = 5,
                      tri = FALSE)

## Remove stopwords
corp_fcm_win_5_select <- corp_fcm_win_5 %>%
  fcm_remove(
    pattern = stopwords(),
    case_insensitive = T
  )
  • We first identify a list of top 50/100 contextual feature words
## Find top features
corp_fcm_win_5_top50 <- corp_fcm_win_5_select %>% colSums %>% sort(.,decreasing=T) %>% .[1:50]
corp_fcm_win_5_top100 <-corp_fcm_win_5_select %>% colSums %>% sort(.,decreasing=T) %>% .[1:100]

corp_fcm_win_5_top50
          us       people          can         upon         must        great 
        2467         2424         2278         1842         1816         1535 
         may        every        shall       States   Government        world 
        1508         1496         1470         1426         1350         1315 
     country          one          new        peace       nation   government 
        1275         1206         1177         1118         1099         1088 
      public        power      America          now     citizens         free 
        1062         1032          997          972          951          915 
        time     American      freedom      nations       United Constitution 
         900          890          826          816          816          811 
         war        years         make         made     national      without 
         742          727          722          691          689          684 
        good        never         life          men    President       spirit 
         652          633          631          614          600          590 
    Congress         laws          law         work       rights         just 
         579          574          573          571          563          555 
       Union         best 
         546          539 
corp_fcm_win_5_top100
          us       people          can         upon         must        great 
        2467         2424         2278         1842         1816         1535 
         may        every        shall       States   Government        world 
        1508         1496         1470         1426         1350         1315 
     country          one          new        peace       nation   government 
        1275         1206         1177         1118         1099         1088 
      public        power      America          now     citizens         free 
        1062         1032          997          972          951          915 
        time     American      freedom      nations       United Constitution 
         900          890          826          816          816          811 
         war        years         make         made     national      without 
         742          727          722          691          689          684 
        good        never         life          men    President       spirit 
         652          633          631          614          600          590 
    Congress         laws          law         work       rights         just 
         579          574          573          571          563          555 
       Union         best      liberty    political        right      justice 
         546          539          536          533          532          532 
       among         hope    interests      foreign         duty         know 
         532          531          521          511          488          485 
        well         many       common      history          let       within 
         482          479          463          459          455          453 
       human         much       fellow        first         long    Americans 
         449          443          441          439          438          433 
      policy          God     together          Let     progress        still 
         430          430          426          423          422          407 
      system       powers         ever       future        faith        whole 
         405          404          404          403          402          398 
       today          man      present      purpose   principles   confidence 
         397          390          386          384          381          378 
         way       better       always      support         less          far 
         378          377          376          375          374          373 
      duties         part        whose      service 
         371          368          364          364 
  • We subset the FCM where the target and contextual words are on the top 50 important contextual features.
  • We then create a network of these top 50 important features.
## Subset fcm for plot
fcm4network <- fcm_select(corp_fcm_win_5_select,
                       pattern = names(corp_fcm_win_5_top50), 
                       selection = "keep",
                       valuetype = "fixed")
fcm4network
Feature co-occurrence matrix of: 93 by 93 features.
          features
features   life one Country can never years every time country citizens
  life        2   1       0   5     0     4     7    0       0        1
  one         1  46       0  12     2     3     8    6       8       10
  Country     0   0       0   1     1     0     0    0       0        0
  can         5  12       1  18    17     1    11    8      13        3
  never       0   2       1  17     8     2     2    0       2        1
  years       4   3       0   1     2    18     0    1       0        1
  every       7   8       0  11     2     0    34    3      11        6
  time        0   6       0   8     0     1     3   10       5        4
  country     0   8       0  13     2     0    11    5       2        4
  citizens    1  10       0   3     1     1     6    4       4       12
[ reached max_nfeat ... 83 more features, reached max_nfeat ... 83 more features ]
## plot semantic network of top features
require(scales)
textplot_network(
  fcm4network,
  min_freq = 5, 
  vertex_labelcolor = c("grey40"),
  vertex_labelsize = 2 * scales::rescale(rowSums(fcm4network +
                                                   1), to = c(1.5, 5))
)


Exercise 12.3 In the above example, our original goal was to subset the FCM by including only the target words (rows) and contextual features (columns) that are on the list of the top 50 contextual features (names(corp_fcm_win_5_top50)). In other words, the subset FCM should have been a 50 by 50 matrix? But it’s not. Why? (Check dim(fcm4network))


  • In addition to a network, we can also create a dendrogram showing the semantic relations in-between a set of contextual features. (For example, the following shows a cluster analysis of the top 100 contextual features.)
## plot the dendrogram

## compute cosine similarity
fcm4network <- corp_fcm_win_5_select %>%
  fcm_keep(pattern = names(corp_fcm_win_5_top100))

## Check dimensions
fcm4network
Feature co-occurrence matrix of: 175 by 175 features.
         features
features  Among life present one Country whose can never years every
  Among       0    1       0   0       0     0   0     0     0     0
  life        1    2       1   1       0     1   5     0     4     7
  present     0    1       0   1       0     0   1     1     1     1
  one         0    1       1  46       0     0  12     2     3     8
  Country     0    0       0   0       0     1   1     1     0     0
  whose       0    1       0   0       1     2   3     2     2     1
  can         0    5       1  12       1     3  18    17     1    11
  never       0    0       1   2       1     2  17     8     2     2
  years       0    4       1   3       0     2   1     2    18     0
  every       0    7       1   8       0     1  11     2     0    34
[ reached max_nfeat ... 165 more features, reached max_nfeat ... 165 more features ]
## network
textplot_network(
  fcm4network,
  min_freq = 5, 
  vertex_labelcolor = c("grey40"),
  vertex_labelsize = 2 * scales::rescale(rowSums(fcm4network +
                                                   1), to = c(1.5, 5))
)

## Remove words with no co-occurrence data
fcm4cluster <- corp_fcm_win_5_select[names(corp_fcm_win_5_top100),]

## check
fcm4cluster
Feature co-occurrence matrix of: 100 by 9,996 features.
        features
features Fellow-Citizens Senate House Representatives Among vicissitudes
  us                   0      0     1               0     0            0
  people               0      0     0               0     0            0
  can                  0      0     0               0     0            0
  upon                 1      0     1               1     0            0
  must                 0      1     1               0     0            0
  great                0      0     0               0     0            0
  may                  0      0     0               0     0            0
  every                0      0     0               0     0            0
  shall                0      0     0               0     0            0
  States               1      0     0               0     0            0
        features
features incident life event filled
  us            0    5     0      0
  people        0    4     0      0
  can           0    5     2      0
  upon          0    2     0      0
  must          0    0     0      0
  great         0    3     0      0
  may           0    1     1      0
  every         1    7     0      0
  shall         0    3     1      0
  States        0    0     1      0
[ reached max_nfeat ... 90 more features, reached max_nfeat ... 9,986 more features ]
## cosine
fcm4cluster_cosine <- textstat_simil(fcm4cluster, method = "cosine")

## create hclust
fcm_hclust<- hclust(as.dist((1 - fcm4cluster_cosine)))

## plot dendrogram v1
plot(as.dendrogram(fcm_hclust, hang = 0.3),
     horiz = T,
     cex = 0.6)

## plot dendrogram v2
ggdendrogram(fcm_hclust,
             rotate = TRUE,
             theme_dendro = TRUE, cex = 0.7)

The dfm or fcm come with many potentials. Please refer to the quanteda documentation for more applications.

12.4 Semantic Representations (Supplement)

When we want a computer to understand what a word means, we cannot just give it a dictionary definition. Instead, we represent words as numbers—specifically, as lists of numbers called vectors.

In modern computational linguistics, there are two main ways to build these numerical representations: Count-Based Models and Prediction-Based Models.


12.4.1 Count-Based Models: “Judging a Word by its Neighbors”

The core idea here comes from the famous linguistic maxim: “You shall know a word by the company it keeps.” Count-based models look at a target word and literally count how many times other words appear near it within a text.

  • How it works: If you have a 5-word window, and the word coffee frequently appears near words like drink, cup, beans, and morning, the computer builds a profile for coffee based on those counts.
  • The Output: A large grid (a matrix) where rows are words and columns are the context words.
  • Pros & Cons: It is highly intuitive and straightforward to calculate, but the tables can become massive and mostly filled with zeros (sparse data) because most words do not appear next to each other.

12.4.2 Prediction-Based Models: Learning through Statistical Modeling

Instead of simply using co-occurrence frequency counts, prediction-based models treat learning word meaning as a statistical modeling. The computer is trained to predict a target word based on its local context, or to predict the context words given a single target token.

To build these representations efficiently, these models rely on dimensionality reduction or neural network architectures:

  • Latent Semantic Analysis (LSA): This technique takes a giant count-based table and compresses it using advanced dimensional reduction techniques/algorithms. It filters out the “noise” and uncovers the hidden (latent) structural relationships between words, resulting in much smaller, denser vectors.
  • Word2Vec (Neural Networks): This approach uses a neural network with a single hidden layer to learn word representations (i.e., embeddings). Instead of scanning entire sentences, the model moves a local sliding window across the text. It either uses the surrounding context words to predict the center target word, or uses the target word to predict the surrounding context. By calculating prediction errors across millions of tokens, the network continuously adjusts the weights of its layers. Over time, these weights become the dense vector representations, positioning words that share similar distribution profiles (like cat and dog) close together in the vector space.

12.4.3 Summary: The Intuitive Difference

Feature Count-Based Models Prediction-Based Models
Core Action Counting frequencies in a window Predicting missing words via training
Data Structure Large, sparse matrices (lots of zeros) Compact, dense vectors (continuous numbers)
Typical Tool Feature Co-occurrence Matrix (FCM) Word2Vec, GloVe, or LSA

Within prediction-based approaches, Word2Vec (Mikolov et al., 2013) is the most prominent framework. Instead of counting neighbors across a whole text, it uses a shallow neural network and a sliding window to learn word vectors through one of two training architectures: CBOW or Skip-gram.


Continuous Bag-of-Words (CBOW)

The CBOW model predicts a target word based on its surrounding context.

  • Mechanism: The network takes the surrounding words within a window and aggregates them to guess the missing word in the middle.
  • Analogy: A standard fill-in-the-blank test.

“I drank a cup of hot ______ this morning.”

\(\rightarrow\) Prediction: coffee

  • Strengths: Faster to train and highly effective for frequent words.

Skip-gram

The Skip-gram model does the exact opposite: it predicts the surrounding context based on a single target word.

  • Mechanism: The network takes one specific word and attempts to guess which words are likely to appear nearby.
  • Analogy: A word association game.

Target: coffee

\(\rightarrow\) Predictions: beans, brewed, morning

  • Strengths: Better at representing rare words and subtle semantic relationships, though it requires more computing power.

We will come back to the implementation in the next Chapter.


12.5 Exercises

Exercise 12.4 In this exercise, please create a dendrogram of the documents included in corp_us according to their similarities in trigram uses.

Specific steps are as follows:

  1. Please create a dfm, where the contextual features are the trigrams in the documents.
  2. After you create the dfm, please trim the dfm according to the following distributional criteria:
  • Include only trigrams consisting of \\w characters ((using dfm_select())
  • Include only trigrams whose frequencies are \(\geq\) 2 (using dfm_trim())
  • Include only trigrams whose document frequencies are \(\geq\) 2 (i.e., used in at least two different presidential addresses)
  1. Please use the cosine-based distance for cluster analysis
  • A Sub-sample of the trigram-based dfm (after the trimming according to the above distributional cut-off, the total number of trigrams in the dfm is: 7917):
  • Example of the dendrogram based on the trigram-based dfm:

Exercise 12.5 Based on the corp_us, we can study how words are connected to each other. In this exercise, please create a dendrogram of important words in the corp_us according to their similarities in their co-occurring words. Specific steps are as follows:

  1. Please create a tokens object of corpus by removing punctuations, symbols, and numbers first.
  2. Please create a window-based fcm of the corpus from the tokens object by including words within the window size of 5 as the contextual words
  3. Please remove all the stopwords included in quanteda::stopwords() from the fcm
  4. Please create a dendrogram for the top 50 important words from the resulting fcm using the cosine-based distance metrics. When clustering the top 50 features, use the co-occurrence information from the entire fcm, i.e., clustering these top 50 features according to their co-occurring words within the window size.
  • A Sub-sample of the fcm (after removing the stopwords, there are 9996 features in the fcm):
  • The dimension of the input matrix for textstats_simil() should be: 50 rows and 9996 columns.

  • Example of the dendrogram of the top 50 features in fcm:

Exercise 12.6 In this exercise, please create a dendrogram of the Taiwan Presidential Addresses included in demo_data/TW_President.tar.gz according to their similarities in ngram uses.

Specific steps are as follows:

  1. Load the corpus data and word-tokenize the texts using jiebaR to create a tokens object of the corpus
  2. Before word segmentation, normalize the texts by removing all white-spaces and line-breaks.
  3. During the word-tokenization, please remove symbols by setting worker(..., symbol = F)
  4. Please add the following terms into the self-defined dictionary of jiebaR:
c("馬英九", "英九", "蔣中正", "蔣經國", "李登輝", "登輝" ,"陳水扁", "阿扁", "蔡英文")
  1. Create the dfm of the corpus, where the contextual features are (contiguous) unigrams, bigrams and trigrams in the documents.
  2. Please trim the dfm according to the following distributional criteria:
  • Include only ngrams whose frequencies >= 5.
  • Include only ngrams whose document frequencies >= 3 (i.e., used in at least three different presidential addresses)
  1. Convert the count-based DFM into TF-IDF DFM using dfm_tfidf()
  2. Please use the cosine-based distance for cluster analysis
  • A Sub-sample of the trimmed dfm (Count-based; Number of features: 1183):
  • A Sub-sample of the trimmed dfm (TFIDF-based; Number of features: 1183):
  • Example of the dendrogram:


Exercise 12.7 Based on the Taiwan Presidential Addresses Corpus included in demo_data/TW_President.tar.gz, we can study how words are connected to each other. In this exercise, please create a dendrogram of important words in the corpus according to their similarities in their co-occurring words. Specific steps are as follows:

  1. Load the corpus data and word-tokenize the texts using jiebaR to create a tokens object of the corpus
  2. Follow the principles discussed in the previous exercise about text normalization and word segmentation.
  3. Please create a window-based fcm of the corpus from the tokens object by including words within the window size of 2 as the contextual words.
  4. Please remove all the stopwords included in demo_data/stopwords-ch.txt from the fcm
  5. After you get the fcm, create a network for the FCM’s top 50 features using textplot_network().
  6. Create a dendrogram for the top 100 important features offcm using the cosine-based distance metrics. When clustering the top 100 features, use the co-occurrence information from the entire fcm, i.e., clustering these top 100 features according to all their co-occurring words within the window size.
  • A sub-sample of the fcm (After removing stopwords, the FCM dimensions are 5234 by 5234:
  • Network of the top 50 features (Try to play with the aesthetic specifications of the network so that the node sizes reflect the total frequency of each word in the FCM):

  • The dimension of the input FCM for textstats_simil() should be: 100 rows and 5234 columns.

  • Example of the dendrogram:


References

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. Context-predicting semantic vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 238–247.
De Deyne, S., Verheyen, S., & Storms, G. (2016). Structure and organization of the mental lexicon: A network approach derived from syntactic dependency relations and word associations [Book Section]. In A. Mehler, A. Lücking, S. Banisch, P. Blanchard, & B. Job (Eds.), Towards a theoretical framework for analyzing complex linguistic networks (pp. 47–79). Springer. https://doi.org/10.1007/978-3-662-47238-5_3
Firth, J. R. (1957). A synopsis of linguistic theory 1930-1955. In J. R. Firth (Ed.), Studies in linguistic analysis (pp. 1–32). Oxford: Blackwell.
Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next…. International Journal of Corpus Linguistics, 18(1), 137–166.
Harris, Z. (1954). Distributional structure. Word, 10(2/3), 146–162.
Harris, Z. S. (1970). Papers in structural and transformational linguistics [Book]. Reidel.
Jurafsky, D., & Martin, J. H. (2020). Speech & language processing 3rd. Available at https://web.stanford.edu/~jurafsky/slp3/ (Accessed on 2020/04/01).
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley-Interscience.
Lenci, A., Sahlgren, M., Jeuniaux, P., Cuba Gyllensten, A., & Miliani, M. (2022). A comparative evaluation and analysis of three generations of distributional semantic models. Language Resources and Evaluation, 56(4), 1269–1313.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.