Chapter 13 Semantic Representations: Word Embeddings

# --- Core Text Analytics Framework (Quanteda) ---
library(quanteda)             # Main framework for managing and analyzing text
library(quanteda.textmodels)   # Statistical models for scaling and classifying text
library(quanteda.textplots)    # Specialized visualizations like word clouds and networks
library(quanteda.textstats)    # Computation of text statistics (frequency, keyness, etc.)

# --- Data Science & Machine Learning Tools ---
library(tidyverse)             # Collection of tools (dplyr, ggplot2, etc.) for data wrangling
library(word2vec)              # Training and utilizing word embedding models
library(Rtsne)                 # T-Distributed Stochastic Neighbor Embedding for visualization
library(ggrepel)               # Prevents overlapping text labels in ggplot2 charts
library(wordcloud)             # Generation of standard and comparison word clouds

We have finally reached the last chapter of our course materials. My goal for this final module is twofold. First, we will build upon the distributional semantic hypothesis by exploring how word embeddings are trained using neural network architectures. Second, we will integrate the corpus-based methods covered throughout the semester into a cohesive case study, demonstrating a comprehensive workflow for quantitative corpus analysis.

13.1 A Typical Data Analysis Workflow

Our Research Questions
- In Mandarin pop songs, what are the general emotional attitudes toward the concepts of “Man” and “Woman”? Are there differences?
- Do these emotional connections vary among different artists?
- Through this simple example, we will guide you through the typical steps, workflow, and details required for a data science research project.

13.2 Data Collection

The first step in data science is data collection capability.
Use existing corpora or collect your own.
Web scraping is usually the first step for beginners.
We scraped Mandarin pop lyrics and related information (artist, title, lyricist, composer, etc.) from the MOJIM lyrics website.

song_sub <- read_csv('demo_data/song_sub.csv')

13.3 Data Preprocessing

Common Text Processing

Typical text data preprocessing includes:
- Text Normalization
- Text Tokenization
- Text Enrichment/Annotation
Data often contains noise; the first step is usually Data Wrangling—removing characters unrelated to the research topic to facilitate subsequent analysis.

In this example, we removed the following characters from the original lyrics:
- Extra whitespace and line breaks
- Punctuation and special symbols
- English characters and Arabic numerals

## --- PURPOSE: A custom cleaning pipeline to remove noise from raw Chinese lyrics ---
preprocess <- function(doc) {
  
  ## 1. Normalizing Line Breaks:
  ## Collapses multiple consecutive newlines into a single one to maintain structure
  doc <- str_replace_all(doc, "\n+", "\n")
  
  ## 2. Filtering Non-Linguistic Noise:
  ## Uses Unicode properties to remove:
  ## \p{P} - Punctuation (e.g., ！, ？, 。)
  ## \p{S} - Symbols (e.g., ￥, ★, ❤)
  ## \p{N} - Numbers (e.g., 1, 2, 3)
  ## These are replaced with spaces to prevent accidental merging of words
  doc <- str_replace_all(doc, "[\\p{P}\\p{S}\\p{N}]", " ")
  
  ## 3. Filtering Latin Characters (English):
  ## Specifically removes lowercase (\p{Ll}) and uppercase (\p{Lu}) letters
  ## Useful when you want to focus exclusively on Mandarin Chinese analysis
  doc <- str_replace_all(doc, "[\\p{Ll}\\p{Lu}]", " ")
  
  ## 4. Normalizing White Space:
  ## Removes both standard spaces and the Chinese full-width space (\u3000)
  doc <- str_replace_all(doc, "[ \u3000]+", "")
  
  ## 5. Final Formatting & Trimming:
  ## Re-collapses newlines after the cleaning steps above
  doc <- str_replace_all(doc, "\n+", "\n")
  
  ## Splits document into lines, trims whitespace from ends, and joins back together
  lines <- str_split(doc, "\n")[[1]]
  doc <- paste(str_trim(lines), collapse = "\n")
  
  return(doc)
}

## --- APPLICATION ---
## Uses purrr::map_chr to apply the 'preprocess' function to every entry in the 
## lyric column, returning a clean character vector
lyric <- map_chr(song_sub$lyric, preprocess)

print(song_sub$lyric[199]) ## original

[1] "你說的話　在我心中生了根\n愛得很深　所以心會疼\n記憶　在我的心中翻滾\n是不是每一個人　都像我一樣笨\n只怕再問　對彼此都太殘忍\n我能感覺　另外一個人\n我等　等笑容換成淚痕\n愛在崩潰的時候　比較真\n太多疑問　知道答案又如何\n原來容忍不需要天份　只要愛錯一個人\n心痛比快樂更真實　愛為何這樣的諷刺\n我忘了這是第幾次　一見你就無法堅持\n孤獨比擁抱更真實　愛讓人失去了理智\n會不會是我太自私　拒絕更寂寞的日子\n放不開　也看不見未來\n難道這種不完美　才是愛情真實的樣子\n太多疑問　知道答案又如何\n原來容忍不需要天份　只要愛錯一個人\n心痛比快樂更真實　愛為何這樣的諷刺\n我忘了這是第幾次　一見你就無法堅持\n孤獨比擁抱更真實　愛讓人失去了理智\n會不會是我太自私　拒絕更寂寞的日子\n孤獨比擁抱更真實　愛讓人失去了理智\n會不會是我太自私　拒絕更寂寞的日子\n放不開　也看不見未來\n難道這種不完美　才是愛情真實的樣子\n感謝\ntang891228\n修正歌詞"

print(lyric[199]) ## cleaned

[1] "你說的話在我心中生了根\n愛得很深所以心會疼\n記憶在我的心中翻滾\n是不是每一個人都像我一樣笨\n只怕再問對彼此都太殘忍\n我能感覺另外一個人\n我等等笑容換成淚痕\n愛在崩潰的時候比較真\n太多疑問知道答案又如何\n原來容忍不需要天份只要愛錯一個人\n心痛比快樂更真實愛為何這樣的諷刺\n我忘了這是第幾次一見你就無法堅持\n孤獨比擁抱更真實愛讓人失去了理智\n會不會是我太自私拒絕更寂寞的日子\n放不開也看不見未來\n難道這種不完美才是愛情真實的樣子\n太多疑問知道答案又如何\n原來容忍不需要天份只要愛錯一個人\n心痛比快樂更真實愛為何這樣的諷刺\n我忘了這是第幾次一見你就無法堅持\n孤獨比擁抱更真實愛讓人失去了理智\n會不會是我太自私拒絕更寂寞的日子\n孤獨比擁抱更真實愛讓人失去了理智\n會不會是我太自私拒絕更寂寞的日子\n放不開也看不見未來\n難道這種不完美才是愛情真實的樣子\n感謝\n修正歌詞"

Word Segmentation (Chinese Preprocessing)

Since Chinese text does not have explicit word boundaries, the next step after initial cleaning is usually “Word Segmentation.”
In the word-segmented version of the lyrics, we use spaces as delimiters between words.
In this implementation, we used jiebaR for word segmentation. For more detail, please check the course material, Chinese Text Processing.

print(lyric[199]) ## cleaned

[1] "你說的話在我心中生了根\n愛得很深所以心會疼\n記憶在我的心中翻滾\n是不是每一個人都像我一樣笨\n只怕再問對彼此都太殘忍\n我能感覺另外一個人\n我等等笑容換成淚痕\n愛在崩潰的時候比較真\n太多疑問知道答案又如何\n原來容忍不需要天份只要愛錯一個人\n心痛比快樂更真實愛為何這樣的諷刺\n我忘了這是第幾次一見你就無法堅持\n孤獨比擁抱更真實愛讓人失去了理智\n會不會是我太自私拒絕更寂寞的日子\n放不開也看不見未來\n難道這種不完美才是愛情真實的樣子\n太多疑問知道答案又如何\n原來容忍不需要天份只要愛錯一個人\n心痛比快樂更真實愛為何這樣的諷刺\n我忘了這是第幾次一見你就無法堅持\n孤獨比擁抱更真實愛讓人失去了理智\n會不會是我太自私拒絕更寂寞的日子\n孤獨比擁抱更真實愛讓人失去了理智\n會不會是我太自私拒絕更寂寞的日子\n放不開也看不見未來\n難道這種不完美才是愛情真實的樣子\n感謝\n修正歌詞"

print(song_sub$lyric_wordseg[199]) ## first 50 words of song 199

[1] "你 說 的 話 在 我 心 中 生 了 根 \n 愛 得 很 深 所以 心 會 疼 \n 記憶 在 我 的 心 中 翻滾 \n 是 不 是 每 一 個 人 都 像 我 一樣 笨 \n 只 怕 再 問 對 彼此 都 太 殘忍 \n 我 能 感覺 另外 一 個 人 \n 我 等等 笑容 換成 淚痕 \n 愛 在 崩潰 的 時候 比較 真 \n 太多 疑問 知道 答案 又 如何 \n 原來 容忍 不 需要 天份 只要 愛錯 一 個 人 \n 心痛 比 快樂 更 真實 愛 為何 這樣 的 諷刺 \n 我 忘 了 這 是 第幾 次 一 見 你 就 無法 堅持 \n 孤獨 比 擁抱 更 真實 愛 讓 人 失去 了 理智 \n 會不會 是 我 太 自私 拒絕 更 寂寞 的 日子 \n 放 不 開 也 看 不 見 未來 \n 難道 這 種 不 完美 才 是 愛情 真實 的 樣子 \n 太多 疑問 知道 答案 又 如何 \n 原來 容忍 不 需要 天份 只要 愛錯 一 個 人 \n 心痛 比 快樂 更 真實 愛 為何 這樣 的 諷刺 \n 我 忘 了 這 是 第幾 次 一 見 你 就 無法 堅持 \n 孤獨 比 擁抱 更 真實 愛 讓 人 失去 了 理智 \n 會不會 是 我 太 自私 拒絕 更 寂寞 的 日子 \n 孤獨 比 擁抱 更 真實 愛 讓 人 失去 了 理智 \n 會不會 是 我 太 自私 拒絕 更 寂寞 的 日子 \n 放 不 開 也 看 不 見 未來 \n 難道 這 種 不 完美 才 是 愛情 真實 的 樣子 \n 感謝 \n 修正 歌詞"

13.4 Corpus Data Profile

This current dataset collects lyrics from 17 Mandarin male and female singers as text for analysis, totaling 2409 songs.
The selection of these singers was not strictly logical (perhaps reflecting my age ). For higher research rigor, all male and female singers should ideally be included (or a better heuristics for sampling needs to be provided).

The distribution of songs collected per artist is as follows:

song_sub %>%
  group_by(artist) %>%
  summarize(N=n()) %>%
  ggplot(aes(reorder(artist, N), N)) +
           geom_bar(stat="identity", aes(fill=N), color="white") +
           geom_text(aes(label = N), nudge_y = 10) +
           theme(axis.text=element_text(angle=90))+
  labs(x="Artist", y = "Number of Songs") +
  scale_fill_distiller(palette = "RdYlBu") +
  coord_flip() +
  theme(axis.text.y = element_text(angle=0))

13.5 How do we define the sentiment of “Man” and “Woman” concepts?

To determine which concepts are associated with 男人 (Man) and 女人 (Woman), we can compare two methods:

The Classic Approach: Collocations

Collocations identify words that habitually appear together more frequently than chance. This method focuses on the “company words keep” within a specific text.

Mechanism: Uses statistical measures to find words that physically appear near the target within a fixed window.
Intuition: It captures explicit, surface-level associations, such as specific adjectives an artist frequently uses to describe a gender.
Analysis: We extract these highly linked words and analyze their emotional orientation using sentiment lexicons to see if certain genders are consistently paired with specific moods.

You can utilize the fcm() function introduced in Chapter 12 to extract words co-occurring with MEN and WOMEN within a fixed window size. By calculating the lexical associations of these co-occurrence patterns, you can determine a distinct set of significant collocates for MEN and WOMEN respectively. Ideological interpretations can then be made based on these contrasting collocates.

The Deep Learning Approach: Word Embeddings

Word embeddings map words into a high-dimensional vector space where geometric closeness corresponds to semantic similarity, even if the target words never co-occur in the same sentence. For ideological analysis, this shifting of words into a multi-dimensional space is invaluable because it captures systemic cultural associations and subtle framing biases rather than just explicit declarations.

Mechanism: The model learns latent relationships based on the Distributional Hypothesis, which states that words occurring in similar contexts share similar meanings.
Intuition: It captures conceptual neighbors and semantic clusters, revealing deeper, implicit societal associations and latent ideologies that simple keyword frequencies might miss.
Application for Ideological Analysis: By projecting abstract concepts into a shared vector space, we can mathematically calculate the distance between target categories (such as gender vectors like MEN and WOMEN) and specific evaluative attributes (such as domesticity, career, or emotion clusters). This allows researchers to quantify the systemic, non-obvious biases embedded in a text or an entire corpus, providing empirical evidence of how specific groups or ideas are culturally framed.

Advantages of Word Embeddings

Continuous Vector Representations: Word embeddings represent words as dense, low-dimensional vectors of real numbers (e.g., 300 dimensions), replacing sparse methods like one-hot encoding. This numerical format allows computers to perform mathematical operations on text.
The Distributional Hypothesis: The underlying linguistic assumption is that words occurring in similar contexts share similar semantic properties. Embedding algorithms derive meaning by analyzing these co-occurrence patterns across large corpora.
Representation Learning: These numerical vectors are learned automatically. Through predictive training architectures (like Word2Vec), a network continuously minimizes prediction errors, adjusting coordinates so that distributionally similar words (like cat and dog) are positioned close together in the vector space.
Foundation for Language Models: These dense vectors serve as the foundational inputs for modern natural language processing, deep learning architectures, and Large Language Models (LLMs), allowing them to capture subtle syntactic and semantic relationships.

Hugging Face operates as a central hub for the open-source artificial intelligence community, hosting thousands of pre-trained embedding models that are publicly accessible for academic and linguistic research. Through the Hugging Face Hub, researchers can download state-of-the-art models for text, speech, and multimodal representations across hundreds of languages. These resources allow you to leverage massive, deeply trained architectures—such as BERT, RoBERTa, and Chinese language models developed by CKIP Lab—without the need for immense local computational power, making advanced representation learning highly accessible for quantitative text analysis.

13.6 A Step-by-Step Training of Word Embeddings

This practical guide walks you through the entire pipeline of training and analyzing predictive word embeddings in R using the word2vec package.

We will move step-by-step from raw token preparation and hyperparameter configuration to model training. Once the embeddings are learned, we can extract semantic neighbors of the seed words (e.g., 男人, 女人) and map the high-dimensional vectors onto a 2D space using t-SNE and PCA.

a. Prepare the Tokens

The word2vec package requires the corpus data to be in the form of a list of word-tokenized character vectors, where each document has been split into independent word units. The most efficient way to prepare your text for this pipeline is to first generate a standard tokens object of your corpus.

head(song_sub)

## 1. Convert Segmented Text to Tokens:
## Splits the space-separated character strings into individual word units (tokens)
song_sub_tokens <- str_split(song_sub$lyric_wordseg, "\\s+")

## 2. Stopword Preparation:
library(tmcn)
## Retrieve a standard Chinese stopword list and ensure it is in Traditional Chinese
## to match the script used in Mandarin Pop lyrics
stopwords_zh <- toTrad(stopwords("zh", source = "misc"))

## 3. Custom Filtering:
## Add domain-specific noise words commonly found in lyric metadata
stopwords_zh <- c(stopwords_zh, "感謝", "修正", "提供", "歌詞")

## 4. Token Cleaning Pipeline:
song_sub_tokens %>%
  ## Convert the raw list into a quanteda 'tokens' object for efficient processing
  as.tokens() %>%
  ## Remove all functional words and custom noise defined in our stopword list
  tokens_remove(stopwords_zh) %>%
  ## Convert the cleaned tokens back into a standard R list for downstream analysis
  as.list() -> song_sub_tokens

b. Training the Model

To train the embeddings model, we use the word2vec() function.

This function allows you to configure your model by selecting the training architecture—either Skip-gram or CBOW—and adjusting key hyperparameters like vector dimensionality, window size, and iteration count.

## Set a seed to ensure that the stochastic (random) weight initialization 
## remains the same every time the code is executed
set.seed(123)

## Model Training Configuration:
model <- word2vec(
  x = song_sub_tokens,         ## Input: A list of character vectors (cleaned tokens)
  
  ## Architecture: Skip-gram focuses on using a target word to predict context, 
  ## which typically handles rare words and semantic nuances better than CBOW
  type = "skip-gram",          
  
  ## Embedding Dimensionality: Maps each word into a 128-dimensional coordinate space
  dim = 128,                   
  
  ## Context Window: The model looks at 5 words before and 5 words after the target
  window = 5,                  
  
  ## Training Iterations: The algorithm makes 20 complete passes over the dataset
  iter = 20,                   
  
  ## Frequency Threshold: Words appearing fewer than 5 times in the corpus are 
  ## excluded to ensure the model only learns from sufficient evidence
  min_count = 5,               
  
  ## Parallel Processing: Distributes the computation across 4 CPU cores to reduce training time
  threads = 4                  
)

## Export the trained model to a binary file so it can be reloaded 
## later without re-running the training process
write.word2vec(model, "embeddings/songs_tokens_embeddings.bin")

Within the Word2Vec framework, there are two primary architectures used to learn word embeddings. Both rely on a “sliding window” approach to analyze context, but they differ in their prediction direction.

CBOW (Continuous Bag-of-Words)

Predicting the target from the context
Logic: The model takes multiple surrounding “context words” as input and tries to predict the single “target word” in the middle.
Performance:
- Faster to train than Skip-gram.
- Works well with frequent words.
Smoothes out distributional information by treating the context as a single “bag” of words.

Skip-gram

Predicting the context from the target
Logic: The model takes a single “target word” as input and tries to predict the multiple “context words” that are likely to appear around it.
Performance:
- Slower to train because it performs more independent predictions.
- Superior at capturing the semantics of rare words and complex relationships.
- Generally preferred for high-quality, large-scale linguistic research.

Key Comparison

Feature	CBOW	Skip-gram
Input	Context words	Target word
Output	Target word	Context words
Data Efficiency	Better for frequent words	Better for rare words
Training Speed	Fast	Slower

c. Extracting and Using Embeddings

Once the model is trained, you can convert the word embeddings into a matrix for analysis or perform nearest-neighbor searches.

Convert to Matrix

This is useful if you want to use the vectors for downstream tasks like clustering or visualization.

## Extract the model into a numeric matrix (Rows = Words, Cols = Dimensions)
embedding_matrix <- as.matrix(model)

## Check dimensions: Returns (Vocabulary Size) x (Vector Size)
dim(embedding_matrix)

[1] 6715  128

## Preview first 5 words and 10 dimensions of raw vector data
embedding_matrix[1:5, 1:10]

                   [,1]       [,2]       [,3]      [,4]       [,5]       [,6]
啦啦啦啦啦啦 -0.5565755  0.5403181 -0.1907424 2.2037737 -0.5880290  0.8335724
不行          2.0012131  0.8623514  1.0751743 0.8538892  0.2703077 -1.6228430
灌醉         -0.3714184  0.2764275 -0.8629256 0.1533670 -0.4302930 -0.8786584
遭           -1.0538697 -0.1597221  1.0251306 0.9514959 -0.1472969 -0.4060965
孩子氣       -0.9931604  2.5363295  0.4723191 0.7263509 -0.4221019  1.7803526
                   [,7]       [,8]      [,9]       [,10]
啦啦啦啦啦啦 -1.1787330 -0.9467657 1.0399548  1.16576743
不行         -0.3668516 -1.6583116 1.1864265 -0.66687912
灌醉          1.3465923 -1.2710948 1.8166435 -0.01753941
遭            0.2323204 -1.3714545 0.0287346  0.71901548
孩子氣        0.2000266  0.8425407 1.5854251  0.70922559

d. Finding Semantic Neighbors

To find words that are semantically close to a target word (the most common use case in linguistics):

## Identify the 10 words with the highest cosine similarity to "愛" (Love)
neighbors <- predict(model, newdata = "愛", type = "nearest", top_n = 10)

## Output the result list showing neighbor terms and their similarity scores
print(neighbors)

$愛
   term1    term2 similarity rank
1     愛 簡簡單單  0.7016374    1
2     愛       我  0.6994971    2
3     愛     判處  0.6983948    3
4     愛     明白  0.6833066    4
5     愛       你  0.6789176    5
6     愛 不明不白  0.6756501    6
7     愛     真愛  0.6751033    7
8     愛     甩甩  0.6744492    8
9     愛       說  0.6733435    9
10    愛     麥克  0.6722113   10

e. Saving and Loading the Model

Since training on large corpora can be time-consuming, you should save your model to a binary file.

## Load it back later
model <- read.word2vec("embeddings/songs_tokens_embeddings.bin")

f. Visualization

Principal Component Analysis (PCA):
- It acts as a global linear lens that rotates and projects high-dimensional data to capture the directions of greatest variation. Because it is a deterministic mathematical transformation, it produces the same result every time and excels at preserving the broad, overall structure of the dataset.
- It is particularly useful for identifying clear semantic directions, such as parallel vectors in gender or tense analogies, where the goal is to maintain the relative distances across the entire “map” of word embeddings.
t-Distributed Stochastic Neighbor Embedding (t-SNE):
- It is a non-linear, probabilistic algorithm designed to prioritize local neighborhoods. It calculates the likelihood that points are neighbors in high-dimensional space and attempts to replicate those specific relationships in a 2D layout.
- While it is stochastic and yields slightly different results depending on the random seed, it is the superior choice for revealing complex, hidden clusters.
- This method is ideal for visualizing how words naturally group into semantic islands that linear methods like PCA might otherwise obscure.

Tsne Demo

## Define a set of focal words for the visualization
target_words <- c('寂寞', '快樂', '難過', '孤單', '高興', '驚訝', '男人', '女人')

## Define a safe extraction function to retrieve neighbors while handling missing vocabulary
get_neighbors <- function(model, target_word, n = 10) {
  ## Convert the model to a matrix for vocabulary checking and vector extraction
  wvs <- as.matrix(model)
  
  ## Verify if the target word exists in the model to avoid execution errors
  if (!(target_word %in% rownames(wvs))) {
    warning(paste("Word '", target_word, "' not in vocabulary."))
    return(character(0))
  }
  
  ## Extract the specific embedding vector for the target word
  target_vector <- wvs[target_word, , drop = FALSE]
  
  ## Identify the nearest neighbors based on the extracted vector
  res <- predict(model, newdata = target_vector, type = "nearest", top_n = n)
  
  ## Return the character vector of neighboring terms
  return(res[[target_word]]$term)
}

## Build a comprehensive list of words including targets and their top 5 neighbors
all_similar_words <- target_words %>%
  map(~get_neighbors(model, .x, n = 5)) %>%
  unlist() %>%
  union(target_words) %>%
  unique()

## Filter the full embedding matrix to keep only the selected subset of words
wvs <- as.matrix(model)
wvs_subset <- wvs[rownames(wvs) %in% all_similar_words, ]

## Execute t-SNE to compress the 128D vectors into 2D coordinates for plotting
## Perplexity is set low due to the small size of the word subset
set.seed(0)
tsne_out <- Rtsne(wvs_subset, perplexity = 3, max_iter = 10000, check_duplicates = FALSE)

## Prepare a data frame for ggplot2, mapping t-SNE coordinates to X and Y axes
tsne_df <- data.frame(
  X = tsne_out$Y[, 1],
  Y = tsne_out$Y[, 2],
  label = rownames(wvs_subset),
  ## Logical flag to distinguish between original targets and their neighbors
  is_target = rownames(wvs_subset) %in% target_words
)

## Render the 2D map using colors to highlight target concepts versus neighbors
ggplot(tsne_df, aes(X, Y, label = label)) +
  geom_point(aes(color = is_target)) +
  ## Use repel to automatically position labels and prevent overlapping text
  geom_text_repel(aes(color = is_target)) +
  scale_color_manual(values = c("royalblue", "firebrick"), 
                     labels = c("Neighbor", "Target Word"),
                     name = "Word Type") +
  theme_minimal() +
  labs(title = "t-SNE Visualization of Word Embeddings")

Guidelines for Determining Perplexity

There is no single “correct” mathematical formula for perplexity, but you can use several heuristics and diagnostics to find the optimal setting.

1. The Sample Size Rule of Thumb

The standard convention in machine learning is to set perplexity between 5 and 50. More importantly, perplexity should scale with the size of your dataset (\(N\)):

The Bound: Perplexity must always be less than the total number of data points (\(N\)).
The Ratio: A common safe zone is to set perplexity to roughly 1% to 10% of your dataset size, depending on how dense your clusters are.

2. Match the Theoretical Cluster Size

If you expect small, tight groupings (e.g., grouping by artist gender, or pairing an artist’s 男人 and 女人 vectors together), your perplexity should match the expected size of those sub-clusters (e.g., 2 to 5).

How to Evaluate the Sweep Results

When you experiment with different perplexity values, evaluate the plots based on these structural behaviors:

If you see…	Interpretation	Action
Points scattered evenly like stars with no clear groupings.	Perplexity is too high for your sample size.	Decrease the perplexity.
Every single data point forms its own tiny, isolated cluster of 1 or 2 points.	Perplexity is too low; the model is over-interpreting random noise as local clusters.	Increase the perplexity.
Stable, distinct shapes that persist across 2 or 3 consecutive perplexity steps.	The Sweet Spot. These structures are robust and reflect true properties of the high-dimensional space.	Use this value for your final analysis.

PCA Demo

## Perform PCA to reduce high-dimensional word vectors into two linear components
## center = TRUE shifts the variables to be zero-centered for accurate rotation
pca_out <- prcomp(wvs_subset, center = TRUE, scale. = FALSE)

## Calculate the percentage of variance captured by each principal component
## This helps students understand how much information is retained in the 2D plot
var_explained <- round(100 * pca_out$sdev^2 / sum(pca_out$sdev^2), 1)

## Prepare the plotting data frame using the first two principal components (PC1 and PC2)
pca_df <- data.frame(
  PC1 = pca_out$x[, 1],
  PC2 = pca_out$x[, 2],
  label = rownames(wvs_subset),
  ## Tag target words vs neighbors for visual differentiation
  is_target = rownames(wvs_subset) %in% target_words
)

## Render the PCA map with informative axis labels showing variance explained
ggplot(pca_df, aes(PC1, PC2, label = label)) +
  geom_point(aes(color = is_target), size = 2) +
  ## box.padding ensures enough clear space around labels for readability
  geom_text_repel(aes(color = is_target), 
                  box.padding = 0.5) +
  scale_color_manual(values = c("royalblue", "firebrick"), 
                     labels = c("Neighbor", "Target Word"),
                     name = "Word Type") +
  theme_minimal() +
  labs(
    title = "PCA Visualization of Word Embeddings",
    subtitle = "Dimensionality reduction using Principal Component Analysis",
    x = paste0("PC1 (", var_explained[1], "%)"),
    y = paste0("PC2 (", var_explained[2], "%)")
  ) +
  theme(legend.position = "bottom")

g. Key `word2vec` Hyperparameters to Consider

Parameter	Recommended Value	Description
`dim`	50 – 300	Higher dimensions capture more nuance but require more data.
`window`	5 – 10	Smaller windows capture functional similarity (synonyms); larger windows capture topical similarity.
`type`	“skip-gram”	Skip-gram usually performs better on smaller datasets or rare words.
`iter`	10 – 50	More iterations can lead to better convergence but take longer.

13.7 Intuitions for Word Embeddings (Self-Study)

Word embeddings transform words into dense vectors of real numbers, typically across hundreds of dimensions. These numerical mappings are established based on the Distributional Hypothesis, which suggests that words appearing in similar linguistic contexts share similar meanings. By representing linguistic units mathematically, we can move beyond simple string matching to analyze the underlying conceptual space of a corpus.

Semantic Calculation and Lexical Distance

The utility of embeddings stems from the ability to quantify lexical distance. In a high-dimensional vector space, the proximity between two word vectors serves as a proxy for their semantic relationship.

Cosine Similarity: This is the standard metric for measuring semantic proximity. It calculates the cosine of the angle between two vectors, where a value approaching 1 indicates high similarity and a narrow angle in the vector space.
Vector Arithmetic: Because these representations are algebraic, they can capture complex relational analogies. This is famously illustrated by the equation:

\[\text{Vector}(\text{"King"}) - \text{Vector}(\text{"Man"}) + \text{Vector}(\text{"Woman"}) \approx \text{Vector}(\text{"Queen"})\]

A Quick Example based on Sinca Corpus

The following R code utilizes a pre-trained embeddings model, sinica_model, to demonstrate how embeddings organize relational concepts. By visualizing these vectors, we can observe how the model mathematically encodes social and familial structures.

a. Dimensionality Reduction (t-SNE)

To make 300-dimensional data interpretable for human vision, we apply t-SNE. This algorithm compresses high-dimensional data into a 2D plane while prioritizing the preservation of local neighborhoods, ensuring that closely related words remains clustered together.

b. Capturing Parallelism

In the visualization, dashed lines connect pairs such as 叔叔 (Uncle) — 阿姨 (Aunt) and 爸爸 (Dad) — 媽媽 (Mom). When these lines appear roughly parallel and of similar length, it provides visual evidence that the model has identified a consistent “gender” dimension. The mathematical “shift” required to move from a male term to its female counterpart is nearly identical across different kinship pairs.

c. Clustering by Semantic Domain

Words within the same semantic field naturally gravitate toward the same region of the vector space. For instance, kinship terms like 爺爺 (Grandpa) and 奶奶 (Grandma) will be significantly closer to each other than to unrelated entities like 公司 (Company), reflecting their frequent co-occurrence in similar textual environments.

We apply this embedding methodology to identify words semantically similar to 男人 and 女人 within the lyrics database. This allows for a deeper downstream analysis of the specific sentiment or emotional traits associated with these gendered concepts across different artists.

13.8 Embeddings Training

13.8.1 Train a base word embedding model on the entire dataset

Previously, we trained a global embedding model (songs_tokens_embeddings.bin) based on the entire song lyrics corpus for illustration.

The next step is to train artist-specific embedding models. By generating distinct models for each individual artist, we can identify and compare the specific sets of lexical items that are conceptually closest to the target concepts, 男人 and 女人, within that particular artist’s body of work.

## Preview the first few rows of the song dataset to verify structure
head(song_sub)

## Load the pre-trained Word2Vec model into the R environment
base_embeddings <- read.word2vec("embeddings/songs_tokens_embeddings.bin")

13.8.2 Artist-Specific Model Training

For each artist, we train an artist-specific word embedding model.
We split the entire database into 17 sub-corpora based on artists. We then trained an independent “Artist Word Embedding” model for each artist.

## Identify all unique artist names in the dataset to drive the loop
all_artists <- unique(song_sub$artist)

## Iteratively train a separate semantic model for every artist found
for (artist in all_artists) {
  
  ## Subset the tokenized lyrics belonging specifically to the current artist
  sub_lyric <- song_sub_tokens[song_sub$artist == artist]
  
  ## Print a progress message to the console showing current artist and song count
  message(sprintf("Working on %s with %d songs", artist, length(sub_lyric)))
  
  ## Train an artist-specific embedding model
  ## This captures the unique word associations and poetic style of the individual
  model_art <- word2vec(
    x = sub_lyric,      ## Input: Current artist's token list
    type = "skip-gram", ## Architecture: Better for small, specific datasets
    dim = 128,          ## Vector size: Kept consistent with the base model
    window = 5,         ## Context window size
    min_count = 5,      ## Minimum frequency threshold for a word to be included
    iter = 20,          ## Training epochs
    threads = 4         ## Parallel processing cores
  )
  
  ## Save the resulting model as a binary file named after the artist
  write.word2vec(model_art, sprintf("embeddings/%s.bin", artist))
}

13.8.3 Identify nearest neighbors

From each artist-specific word embedding model, we found the 100 words most similar to “Man” and “Woman” respectively for subsequent sentiment analysis.
Each artist generated 200 words semantically similar to “Man” and “Woman” (100 each), totaling \(17 \times 200 = 3400\) words.

Out-of-Vocabulary Issue

When training individual embedding models for each artist to analyze concepts like 男人 (Man) and 女人 (Woman), the primary challenge is the Out-of-Vocabulary (OOV) problem. This occurs when a model cannot process a word because it was not present—or not frequent enough—in the specific training data for that artist.

## Define global parameters for the analysis
search_terms <- c("男人", "女人")
top_n_neighbors <- 100
model_dir <- "embeddings"
output_path <- "data/artist_semantic_neighbors.csv"

## Load the global Base Model to provide a semantic reference for all artists
base_model_path <- file.path(model_dir, "base_embeddings.bin")
if (!file.exists(base_model_path)) {
  stop("Base model not found!")
}
base_model <- read.word2vec(base_model_path)
## Extract the embedding matrix for fast vocabulary lookups and vector projection
base_wvs <- as.matrix(base_model)

## Identify all artist-specific binary model files in the directory
## Excludes the main base embedding file from the processing list
model_files <- list.files(model_dir, pattern = "\\.bin$", full.names = TRUE)
model_files <- model_files[!str_detect(model_files, "embeddings\\.bin")]

## Function to extract semantic neighbors while handling Out-of-Vocabulary (OOV) terms
extract_artist_data <- function(model_path, terms, local_base_matrix) {
  artist_name <- str_replace(basename(model_path), "\\.bin$", "")
  message(paste("Processing artist:", artist_name))
  
  ## Load the individual artist's trained embedding space
  model_art <- read.word2vec(model_path)
  wvs_art <- as.matrix(model_art)
  
  ## Iterate through the target concepts to extract their local semantic context
  map_df(terms, function(term) {
    ## Check if the artist has used the term enough times to have a native vector
    if (term %in% rownames(wvs_art)) {
      target_vector <- wvs_art[term, , drop = FALSE]
      mode_type <- "direct"
    } else {
      ## FALLBACK: If word is missing, project the vector from the global Base Model
      ## This allows us to see what an artist "would" associate with a word they rarely use
      message(sprintf("  -> '%s' OOV for %s. Using Base Model projection.", term, artist_name))
      
      if (term %in% rownames(local_base_matrix)) {
        target_vector <- local_base_matrix[term, , drop = FALSE]
        mode_type <- "base_projection"
      } else {
        return(NULL)
      }
    }
    
    ## Extract top neighbors within the artist's specific semantic landscape
    res <- predict(model_art, newdata = target_vector, type = "nearest", top_n = top_n_neighbors)
    
    if (is.null(res[[1]])) return(NULL)
    
    ## Return results in a tidy format for comparative analysis
    data.frame(
      artist = artist_name,
      target_concept = term,
      extraction_mode = mode_type,
      neighbor_word = res[[1]]$term,
      similarity = res[[1]]$similarity,
      stringsAsFactors = FALSE
    )
  })
}

## Execute the extraction across all found artist models
all_simwords <- map_df(model_files, ~extract_artist_data(.x, search_terms, base_wvs))

## Export the consolidated data to a CSV for visualization and sentiment analysis
write.csv(all_simwords, output_path, row.names = FALSE)
message("Done! Results saved to data/artist_semantic_neighbors.csv")

## Display the final data frame
all_simwords

dim(all_simwords)

[1] 3128    5

If we analyze 17 artists with 100 neighbors for each of the two target concepts (男人 and 女人), the resulting data frame would theoretically contain 3,400 rows. However, our current all_simwords data frame contains only 3128 rows. Take a moment to inspect the data on your own to see if you can determine why these rows are missing.

13.9 Sentiment Analysis

Analyzing the Sentiment Distribution of Neighboring Words

What proportion of the words conceptually close to 男人 (men) carry a positive/negative connotation? Furthermore, how many of these neighboring words express happiness, sadness, or anger?
What proportion of the words conceptually close to 女人 (women) carry a positive/negative connotation? Furthermore, how many of these neighboring words express happiness, sadness, or anger?

Check the Dictionary! Sentiment Dictionary!

Sentiment analysis ranges from simple valence models that categorize feelings as positive or negative to more granular discrete models, such as Ekman’s, which identify specific emotional states like joy, anger, or sadness.
Currently, there are a few publicly available Chinese sentiment dictionaries. In this small study, we used the NRC Emotion Dictionary.
NRC is primarily an English sentiment dictionary, but the authors used automatic translation to expand it to over 100 languages worldwide.
We used the Chinese version provided by NRC (Please note the translation accuracy).
See: Saif Mohammad and Peter Turney (2013). Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelligence, 29 (3), 436-465, 2013.

The National Taiwan University Sentiment Dictionary (NTUSD) is an established, widely used linguistic resource developed for Chinese natural language processing and opinion mining. Curated by the Natural Language Processing Laboratory at NTU, it contains over 11,000 entries explicitly categorized into distinct lists of positive and negative words. Because it was specifically optimized for Traditional Chinese text structures and cultural idioms, it serves as an excellent baseline lexicon for researchers analyzing valence, stance, and sentiment trends in Taiwanese digital media, literature, and social discourse.

## Load the NRC Emotion Lexicon and standardize the column name for Chinese words
nrc_dict <- read_tsv("demo_data/NRC-Emotion-Lexicon/OneFilePerLanguage/Chinese-Traditional-NRC-EmoLex.txt") %>%
  rename(Chinese = `Chinese-Traditional Word`)

## Preview the raw dictionary structure
head(nrc_dict)

## Create an emotion-focused dataset (Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust)
nrc_emotion <- nrc_dict %>%
  ## Select specific emotion categories and the word column
  select(anger:joy, sadness:trust, Chinese) %>%
  ## Transform from wide to long format (each word-emotion pair becomes a row)
  pivot_longer(anger:trust, names_to = "emotion") %>%
  rename(word = Chinese) %>%
  ## Keep only entries where the word is actually associated with the emotion (score >= 1)
  filter(value >= 1)

head(nrc_emotion)

## Create a sentiment-focused dataset (Positive vs. Negative)
nrc_sentiment <- nrc_dict %>%
  ## Select the binary sentiment columns
  select(positive:negative, Chinese) %>%
  ## Transform to long format for easier joining with text data
  pivot_longer(positive:negative, names_to = "sentiment") %>%
  rename(word = Chinese) %>%
  ## Filter for active sentiment associations
  filter(value >= 1)


head(nrc_sentiment)

Distribution of NRC Sentiment Lexicon

nrc_sentiment %>%
  count(sentiment) %>%
  ggplot(aes(reorder(sentiment,-n), n, fill=n)) +
  geom_bar(stat="identity", color="white") +
  scale_fill_distiller(palette = "RdYlBu")+
  labs(x = "Sentiment Types in NRC",
       y = "Number of Words",
       title = "Distributions of Words Across Different Emotions in NRC")

Distribution of NRC Emotion Lexicon

nrc_emotion %>%
  count(emotion) %>%
  ggplot(aes(reorder(emotion,-n), n, fill=n)) +
  geom_bar(stat="identity", color="white") +
  scale_fill_distiller(palette = "RdYlBu")+
  labs(x = "Emotion Types in NRC",
       y = "Number of Words",
       title = "Distributions of Words Across Different Emotions in NRC")

13.10 Mapping 男人/女人 Neighbors to NRC Sentiment Dictionary

We map each neighbor word to its corresponding sentiment label from the NRC dictionary.

## Preview the first few rows of the neighbor extraction results
head(all_simwords)

## Perform an inner join to match neighboring words with their sentiment labels
## many-to-many handles words that may carry multiple sentiment tags
sentiment_results <- all_simwords %>%
  inner_join(nrc_sentiment, by = c("neighbor_word" = "word"), relationship= "many-to-many") %>%
  distinct(artist, target_concept, neighbor_word, sentiment)

head(sentiment_results)

We then calculate a simple sentiment density score: for each artist, for each target word (男人 vs. 女人), the percentage of positive/negative words among the neighbors.

sentiment_summary<- sentiment_results %>%
  # 1. Group by both the artist and the specific target word
  group_by(artist, target_concept) %>%
  
  # 2. Calculate density scores based on categorical labels
  summarise(
    neighbor_type = n_distinct(neighbor_word),
    pos_count = sum(sentiment == "positive", na.rm = TRUE),
    neg_count = sum(sentiment == "negative", na.rm = TRUE),
    
    # Density score is the proportion of affective words relative to total neighbors (N)
    pos_density = pos_count / neighbor_type,
    neg_density = neg_count / neighbor_type,
    .groups = "drop"
  )


head(sentiment_summary)

sentiment_summary_long <- sentiment_summary %>%
  select(artist, target_concept, neg_density, pos_density) %>%
  pivot_longer(cols = c(pos_density, neg_density),
               names_to = "sentiment",
               values_to = "density") %>%
  mutate(sentiment = fct_recode(sentiment, "positive" = "pos_density", "negative" = "neg_density"))

Visualization of Results: Sentiment

## Render a barplot
ggplot(sentiment_summary_long, aes(x = sentiment, y = density, fill = sentiment)) +
  # Create the bars
  geom_bar(stat = "identity") +
  # Split the plot into two panels based on the target_concept
  #facet_wrap(~target_concept, ncol=1) + 
  facet_grid(target_concept ~ artist) +
  scale_fill_manual(values = c("positive" = "#4E79A7", "negative" = "#E15759")) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        strip.text = element_text(size = rel(0.5), face = "bold")) + 
  labs(
    title = "Sentiment Distribution of Word Embedding Neighbors",
    subtitle = "Comparing 'Man' vs. 'Woman' across different artists",
    x = "Sentiment Category",
    y = "Density (%)",
    fill = "Sentiment"
  )

## Render a heatmap to identify broad emotional patterns and artist clusters
ggplot(sentiment_summary_long, aes(x = sentiment, y = artist, fill = density)) +
  ## Use white borders between tiles to improve visual clarity
  geom_tile(color = "white", size = 0.2) +
  ## Separate the visualization into "Man" and "Woman" panels
  facet_wrap(~target_concept) +
  ## Apply a Red-Purple gradient to highlight areas of high sentiment density
  scale_fill_distiller(palette = "RdPu", direction = 1, name = "Density (%)") +
  labs(
    title = "Sentiment Density in Mandarin Pop Lyrics",
    subtitle = "Comparing Semantic Neighbors of 'Man' vs. 'Woman'",
    x = "NRC Emotion Category",
    y = "Artist",
    caption = "Density = (Sentiment-matched neighbors / All Neighbors) * 100"
  ) +
  ## Increase spacing between the faceted panels for better legibility
  theme(
    panel.spacing = unit(2, "lines")
  )

Post-hoc Analysis: Exploring 男人 Sentiment-relevant Neighbors

## MALE
sentiment_results %>%
  count(sentiment, neighbor_word, target_concept, sort = TRUE) %>%
  filter(target_concept == "男人") %>%
  select(-target_concept) %>%
  # Reshape data from long to wide format
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  # comparison.cloud requires a matrix with row names
  column_to_rownames("neighbor_word") %>%
  comparison.cloud(colors = c("red", "steelblue"),
                   max.words = 150, 
                   scale = c(2.5, 0.8), 
                   random.order = F,
                   title.size = 4, 
                   title.bg.colors = c("red", "steelblue"))

Post-hoc Analysis: Exploring Sentiment-relevant 女人 Neighbors

## FEMALE
sentiment_results %>%
  count(sentiment, neighbor_word, target_concept, sort = TRUE) %>%
  filter(target_concept == "女人") %>%
  select(-target_concept) %>%
  # Reshape data from long to wide format
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  # comparison.cloud requires a matrix with row names
  column_to_rownames("neighbor_word") %>%
  comparison.cloud(colors = c("red", "steelblue"),
                   max.words = 150, 
                   scale = c(2.4, 0.6), 
                  random.order = F,
                   title.size = 4, 
                   title.bg.colors = c("red", "steelblue"))

Simple Statistical Analysis (Positive/Negative Binary Sentiment)

# --- Step 1: Fit Two-Way ANOVA with Interaction ---
# Evaluates main effects (target_concept, sentiment) and their interaction (*)
anova_model <- aov(density ~ target_concept * sentiment, data = sentiment_summary_long)

# --- Step 2: Display Significance Table ---
summary(anova_model)

                         Df Sum Sq Mean Sq F value   Pr(>F)    
target_concept            1 0.0022  0.0022   0.257    0.614    
sentiment                 1 0.6614  0.6614  78.158 1.83e-12 ***
target_concept:sentiment  1 0.0169  0.0169   1.994    0.163    
Residuals                60 0.5077  0.0085                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# --- Step 3: Diagnostic Visualization ---
# Generates notched boxplots to compare density distributions across sentiment types
boxplot(density ~ sentiment, notch = TRUE, data = sentiment_summary_long)

boxplot(density ~ target_concept, notch = TRUE, data = sentiment_summary_long)

# --- Generate Interaction Plot Directly from Raw Long Data ---
ggplot(sentiment_summary_long, aes(x = target_concept, y = density, color = sentiment, group = sentiment)) +
  # 1. Calculate and plot the mean points
  stat_summary(fun = mean, geom = "point", size = 3.5) +
  
  # 2. Calculate and draw the connecting interaction lines between those means
  stat_summary(fun = mean, geom = "line", size = 1.2) +
  
  # 3. Optional but recommended: Add 95% Confidence Interval error bars
  # This shows the statistical variation among the 17 artists for each point
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.1, alpha = 0.6) +
  
  # Formatting and Styling
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("positive" = "#238b45", "negative" = "#cb181d")) +
  theme_minimal(base_family = "Arial") +
  labs(
    title = "Sentiment x Target Concept Interaction Plot",
    subtitle = "Density means with 95% CI error bars",
    x = "Target Concept",
    y = "Sentiment Density",
    color = "Sentiment Profile"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    legend.position = "top"
  )

With only 17 artists, our current sample size is relatively small. There are several clear paths to refine and expand this analysis—such as increasing the artist sample, implementing more advanced statistical frameworks, or incorporating additional textual metadata. I will leave these avenues open for you to explore further depending on your research interests.

13.11 Conclusion

Corpus Analysis and Computer Science

Research Questions & Hypothesis
Data Collection (Corpus, Web-Crawling, Other Text Digitizations)
Data Preprocessing (Text Pre-processing, Data Wrangling)
Data Analysis (Exploratory Data Analysis & Statistics)
Interpretation (Data Visualization, Reproducible Reports)

The Future of Linguistics?

Strengthen your interdisciplinary skills (Statistics, Programming, and Digital Literacy)
Cultivate a digital mindset
Step out of the humanities comfort zone

Don’t overthink it, just learn it!

Source: Learn Data Science for Free