Chapter 13 Text Analytics: A Start

In this chapter, I will present a quick overview of computational text analytics with R. The most important package for exploratory text analysis is quanteda. As computational text analytics itself is an interesting topic, I would recommend other more advanced courses for those who are interested in this field (e.g., ENC2036:Corpus Linguistics, ENC2045:Computational Linguistics)

This chapter only provides you a very general overview of the common procedures/tasks in text processing.

13.1 Installing quanteda

There are many packages that are made for computational text analytics in R. You may consult the CRAN Task View: Natural Language Processing for a lot more alternatives.

To start with, this tutorial will use a powerful package, quanteda, for managing and analyzing textual data in R. You may refer to the official documentation of the package for more detail.

The libraryquanteda is not included in the default base R installation. Please install the package if you haven’t done so.

install.packages("quanteda")
install.packages("quanteda.textmodels")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")
install.packages("readtext")
# remotes::install_github("quanteda/quanteda.corpora")
# remotes::install_github("kbenoit/quanteda.dictionaries")

Also, as noted on the quanteda documentation, because this library compiles some C++ and Fortran source code, you need to install the appropriate compilers.

  • If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN.
  • If you are using macOS, you should install the macOS tools.

If you run into any installation errors, please go to the official documentation page for additional assistance.

The library quanteda contains all of the core natural language processing and textual data management functions.The following libraries work with quanteda, providing more advanced functions for computational text analytics.

  • quanteda.textmodels: includes the text models and supporting functions (i.e., textmodel_*() functions).
  • quanteda.textstats: includes statistic processing for textual data (i.e., textstat_*() functions).
  • quanteda.textplots: includes plotting functions for textual data (i.e., textplot_*() functions).
  • quanteda.corpora: Additional textual data for use with quanteda. (Install using: remotes::install_github("quanteda/quanteda.corpora")
  • quanteda.dictionaries: Various dictionaries for use with quanteda, including the function liwcalike(), an R implementation of the Linguistic Inquiry and Word Count approach to text analysis. (Install using: remotes::install_github("kbenoit/quanteda.dictionaries")
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(tidyverse)
packageVersion("quanteda")
[1] '4.1.0'

13.2 Building a corpus from character vector

To demonstrate a typical corpus analytic example with texts, I will be using a pre-loaded corpus that comes with the quanteda package, data_corpus_inaugural. This is a corpus of US Presidential Inaugural Address texts and metadata for the corpus from 1789 to present.

data_corpus_inaugural
Corpus consisting of 59 documents and 4 docvars.
1789-Washington :
"Fellow-Citizens of the Senate and of the House of Representa..."

1793-Washington :
"Fellow citizens, I am again called upon by the voice of my c..."

1797-Adams :
"When it was first perceived, in early times, that no middle ..."

1801-Jefferson :
"Friends and Fellow Citizens: Called upon to undertake the du..."

1805-Jefferson :
"Proceeding, fellow citizens, to that qualification which the..."

1809-Madison :
"Unwilling to depart from examples of the most revered author..."

[ reached max_ndoc ... 53 more documents ]
class(data_corpus_inaugural)
[1] "corpus"    "character"
length(data_corpus_inaugural)
[1] 59
names(data_corpus_inaugural)
 [1] "1789-Washington" "1793-Washington" "1797-Adams"      "1801-Jefferson" 
 [5] "1805-Jefferson"  "1809-Madison"    "1813-Madison"    "1817-Monroe"    
 [9] "1821-Monroe"     "1825-Adams"      "1829-Jackson"    "1833-Jackson"   
[13] "1837-VanBuren"   "1841-Harrison"   "1845-Polk"       "1849-Taylor"    
[17] "1853-Pierce"     "1857-Buchanan"   "1861-Lincoln"    "1865-Lincoln"   
[21] "1869-Grant"      "1873-Grant"      "1877-Hayes"      "1881-Garfield"  
[25] "1885-Cleveland"  "1889-Harrison"   "1893-Cleveland"  "1897-McKinley"  
[29] "1901-McKinley"   "1905-Roosevelt"  "1909-Taft"       "1913-Wilson"    
[33] "1917-Wilson"     "1921-Harding"    "1925-Coolidge"   "1929-Hoover"    
[37] "1933-Roosevelt"  "1937-Roosevelt"  "1941-Roosevelt"  "1945-Roosevelt" 
[41] "1949-Truman"     "1953-Eisenhower" "1957-Eisenhower" "1961-Kennedy"   
[45] "1965-Johnson"    "1969-Nixon"      "1973-Nixon"      "1977-Carter"    
[49] "1981-Reagan"     "1985-Reagan"     "1989-Bush"       "1993-Clinton"   
[53] "1997-Clinton"    "2001-Bush"       "2005-Bush"       "2009-Obama"     
[57] "2013-Obama"      "2017-Trump"      "2021-Biden"     

We create a corpus() object with the pre-loaded corpus in quantedadata_corpus_inaugural:

corp_us <- corpus(data_corpus_inaugural) # save the `corpus` to a short obj name
summary(corp_us)

After the corpus is loaded, we can use summary() to get the metadata of each text in the corpus, including word types and tokens as well. This allows us to have a quick look at the size of the addresses made by all presidents.

In quanteda, it has implemented a default tokenization method for English texts. I think it has also implemented a default tokenization method for Chinese texts as well. For more control on the word segmentation, you may need to consult other segmentation packages (e.g., jiebaR, ckiptagger etc.). More details are usually discussed in ENC2045 or ENC2036.

corp_us %>%
  summary %>%
  ggplot(aes(x = Year, y = Tokens, group = 1)) +
    geom_line() +
    geom_point() +
    theme_bw()


Exercise 13.1 Could you reproduce the above line plot and add information of President to the plot as labels of the dots?

Hints: Please check ggplot2::geom_text() or more advanced one, ggrepel::geom_text_repel()


13.3 Tokenization

The first step for most textual analyses is usually tokenization, i.e., breaking each long text into word tokens for linguistic analysis.

corp_us_tokens <- tokens(corp_us)
corp_us_tokens[1]
Tokens consisting of 1 document and 4 docvars.
1789-Washington :
 [1] "Fellow-Citizens" "of"              "the"             "Senate"         
 [5] "and"             "of"              "the"             "House"          
 [9] "of"              "Representatives" ":"               "Among"          
[ ... and 1,525 more ]

Starting from quanteda version 4.0+, a new class of tokens object, tokens_xptr, was introduced. This object enhances the efficiency of adjusting tokens objects based on large text corpora.

13.4 Keyword-in-Context (KWIC)

Keyword-in-Context (KWIC), or concordances, are the most frequently used method in corpus analysis. The idea is very intuitive: we get to know more about the semantics of a word (or any other linguistic unit) by examining how it is being used in a wider context.

We can use kwic() to perform a search for a word and retrieve its concordances from the corpus:

kwic(corp_us_tokens, "terror", window = 5)

kwic() returns a data frame, which can be easily exported as a CSV file for later use.

Please note that kwic() takes a tokens object as the input argument.

That is, please tokenize the corpus object with tokens() first before you perform more advanced textual analysis, e.g., kwic().

Also, with kwic(), the pattern you look for cannot be a multi-word linguistic pattern.

For Chinese, quanteda can take care of Chinese word segmentation but with rather limited capacity.

texts <- c("舞台正中間擺著一張「空著的導演椅」,影片一下全場鼻酸。",
           "第58屆金馬獎頒獎典禮今(27)日在國父紀念館盛大登場,星光大道紅毯於下午5點30分登場。")
corpus_ch <- corpus(texts)
corpus_ch
Corpus consisting of 2 documents.
text1 :
"舞台正中間擺著一張「空著的導演椅」,影片一下全場鼻酸。"

text2 :
"第58屆金馬獎頒獎典禮今(27)日在國父紀念館盛大登場,星光大道紅毯於下午5點30分登場。"
corpus_ch_tokens <- tokens(corpus_ch)
corpus_ch_tokens[[1]]
 [1] "舞台" "正"   "中間" "擺著" "一張" "「"   "空"   "著"   "的"   "導演"
[11] "椅"   "」"   ","   "影片" "一下" "全"   "場"   "鼻酸" "。"  
corpus_ch_tokens[[2]]
 [1] "第"         "58"         "屆"         "金馬獎"     "頒獎典禮"  
 [6] "今"         "("         "27"         ")"         "日"        
[11] "在"         "國父紀念館" "盛大"       "登場"       ","        
[16] "星光"       "大道"       "紅"         "毯"         "於"        
[21] "下午"       "5"          "點"         "30"         "分"        
[26] "登場"       "。"        

13.5 KWIC with Regular Expressions

For more complex searches, we can use regular expressions as well in kwic(). For example, if you want to include terror and all its other related word forms, such as terrorist, terrorism, terrors, you can do a regular expression search.

kwic(corp_us_tokens, "terror.*", valuetype = "regex")

By default, the kwic() is word-based. If you like to look up a multiword combination, use phrase():

kwic(corp_us_tokens, phrase("our country"))
kwic(corp_us_tokens, phrase("[a-zA-Z]+ against"), valuetype="regex")

It should be noted that the output of kwic includes not only the concordances (i.e., preceding/subsequent co-texts + the keyword), but also the sources of the texts for each concordance line. This would be extremely convenient if you need to refer back to the original discourse context of the concordance line.


Exercise 13.2 Please create a bar plot, showing the number of uses of the word country in each president’s address. Please include different variants of the word, e.g., countries, Countries, Country, in your kwic() search.

13.6 Lexical Density Plot

Plotting a kwic object produces a lexical dispersion plot, which allows us to visualize the occurrences of particular terms throughout the texts.

corp_us_tokens %>%
  tokens_subset(Year > 1949) %>%
  kwic(pattern= "american") %>%
  textplot_xray()

## subset the corpus
corp_us_subset <- corp_us_tokens %>%
  tokens_subset(Year > 1949)

## Plotting
textplot_xray(kwic(corp_us_subset, pattern = "american"),
              kwic(corp_us_subset, pattern = "people")) +
  aes(color = keyword) + scale_color_manual(values = c('lightpink', 'royalblue'))

13.7 Collocation

## collocation (raw)
textstat_collocations(corp_us_tokens, size = 2, min_count = 5)
## collocation (filtered)
corp_us_tokens %>%
    tokens_remove(stopwords("en")) %>%
    tokens_select(pattern = "^[A-Z]", valuetype = "regex", 
                  case_insensitive = FALSE, padding = TRUE) %>%
    textstat_collocations(size = 2, min_count = 5, tolower = T)
## Multiwords
corp_us_tokens %>%
    tokens_remove(stopwords("en")) %>%
    tokens_select(pattern = "^[A-Z]", valuetype = "regex", 
                  case_insensitive = FALSE, padding = TRUE) %>%
    textstat_collocations(size = 3, min_count = 2, tolower = T)

13.8 Document-Feature Matrix

Another important object class is defined in quanteda: the dfm. It stands for Document-Feature-Matrix. It’s a two-dimensional co-occurrence table, with the rows being the documents in the corpus, and columns being the features used to characterize the documents. The cells in the matrix often refer to the co-occurrence statistics between each document and the feature.

Usually, we first create the tokens version of the corpus (using tokens()) and then use dfm() to create the dfm version of the corpus from tokens. That is, it is recommended to use the tokens as the input for dfm.

corp_us_dfm <- corp_us_tokens %>% dfm
class(corp_us_dfm)
[1] "dfm"
attr(,"package")
[1] "quanteda"
corp_us_dfm
Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives :
  1789-Washington               1  71 116      1  48     2               2 1
  1793-Washington               0  11  13      0   2     0               0 1
  1797-Adams                    3 140 163      1 130     0               2 0
  1801-Jefferson                2 104 130      0  81     0               0 1
  1805-Jefferson                0 101 143      0  93     0               0 0
  1809-Madison                  1  69 104      0  43     0               0 0
                 features
docs              among vicissitudes
  1789-Washington     1            1
  1793-Washington     0            0
  1797-Adams          4            0
  1801-Jefferson      1            0
  1805-Jefferson      7            0
  1809-Madison        0            0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,427 more features ]

We can see that in the first document, i.e., 1789-Washington, there are 2 occurrences of representatives, 48 occurrences of and.

13.9 Feature Selection

A dfm may not be as informative as we have expected — not ALL THE WORDS reflect the semantics of the document. To better capture the documental semantic representation, there are several important factors that need to be more carefully considered with respect to the features of the dfm:

  • The granularity of the features
  • The informativeness of the features
  • The distributional properties of the features

Not only does dfm() provide many arguments for users to specify conditions for features selection; in quanteda, we can also apply dfm_trim() to select important features for later analysis.

corp_dfm_trimmed <- corp_us %>% 
  tokens( remove_punct = T,
          remove_numbers= T,
          remove_symbols = T) %>%
  dfm %>%
  dfm_remove(stopwords("en")) %>% 
  dfm_trim(min_termfreq = 10, termfreq_type = "count",
           min_docfreq = 3, max_docfreq = ndoc(corp_us)-1,
           docfreq_type = "count") 

dim(corp_us_dfm)
[1]   59 9437
dim(corp_dfm_trimmed)
[1]   59 1402

13.10 Top Features

With a dfm, we can check important features from the corpus.

topfeatures(corp_dfm_trimmed,10)
    people government         us        can       must       upon      great 
       584        564        505        487        376        371        344 
       may     states      world 
       343        334        319 

We can also extract ngram-based features from each document for analysis:

corp_us_tokens %>%
  tokens_ngrams(n = 3)
Tokens consisting of 59 documents and 4 docvars.
1789-Washington :
 [1] "Fellow-Citizens_of_the"   "of_the_Senate"           
 [3] "the_Senate_and"           "Senate_and_of"           
 [5] "and_of_the"               "of_the_House"            
 [7] "the_House_of"             "House_of_Representatives"
 [9] "of_Representatives_:"     "Representatives_:_Among" 
[11] ":_Among_the"              "Among_the_vicissitudes"  
[ ... and 1,523 more ]

1793-Washington :
 [1] "Fellow_citizens_," "citizens_,_I"      ",_I_am"           
 [4] "I_am_again"        "am_again_called"   "again_called_upon"
 [7] "called_upon_by"    "upon_by_the"       "by_the_voice"     
[10] "the_voice_of"      "voice_of_my"       "of_my_country"    
[ ... and 133 more ]

1797-Adams :
 [1] "When_it_was"         "it_was_first"        "was_first_perceived"
 [4] "first_perceived_,"   "perceived_,_in"      ",_in_early"         
 [7] "in_early_times"      "early_times_,"       "times_,_that"       
[10] ",_that_no"           "that_no_middle"      "no_middle_course"   
[ ... and 2,563 more ]

1801-Jefferson :
 [1] "Friends_and_Fellow"   "and_Fellow_Citizens"  "Fellow_Citizens_:"   
 [4] "Citizens_:_Called"    ":_Called_upon"        "Called_upon_to"      
 [7] "upon_to_undertake"    "to_undertake_the"     "undertake_the_duties"
[10] "the_duties_of"        "duties_of_the"        "of_the_first"        
[ ... and 1,909 more ]

1805-Jefferson :
 [1] "Proceeding_,_fellow"          ",_fellow_citizens"           
 [3] "fellow_citizens_,"            "citizens_,_to"               
 [5] ",_to_that"                    "to_that_qualification"       
 [7] "that_qualification_which"     "qualification_which_the"     
 [9] "which_the_Constitution"       "the_Constitution_requires"   
[11] "Constitution_requires_before" "requires_before_my"          
[ ... and 2,366 more ]

1809-Madison :
 [1] "Unwilling_to_depart"    "to_depart_from"         "depart_from_examples"  
 [4] "from_examples_of"       "examples_of_the"        "of_the_most"           
 [7] "the_most_revered"       "most_revered_authority" "revered_authority_,"   
[10] "authority_,_I"          ",_I_avail"              "I_avail_myself"        
[ ... and 1,247 more ]

[ reached max_ndoc ... 53 more documents ]
corp_us_tokens %>%
  tokens_ngrams(n = 3) %>%
  dfm %>%
  dfm_trim(min_termfreq = 5, termfreq_type = "count",
           min_docfreq = 2, max_docfreq = ndoc(corp_us)-1,
           docfreq_type = "count") -> corp_dfm_trigram
  
topfeatures(corp_dfm_trigram)
          ._it_is the_united_states         ,_and_the         ._we_have 
              156               152               137               101 
    of_the_united     of_the_people of_the_government          ._we_are 
               99                97                79                68 
     of_the_world         ._we_will 
               63                63 

How to remove irrelevant trigrams? (e.g., ones that include punctuation marks?)

the_united_states     of_the_united     of_the_people of_the_government 
              152                99                97                79 
     of_the_world    of_our_country      of_the_union     of_our_people 
               63                49                49                48 
       as_well_as    of_the_country 
               47                47 

13.11 Wordclouds

With a dfm, we can visualize important words in the corpus with a Word Cloud. It is a novel but intuitive visual representation of text data. It allows us to quickly perceive the most prominent words from a large collection of texts.

corp_dfm_trimmed %>%
textplot_wordcloud(min_count = 50, random_order = FALSE,
                   rotation = .25,
                   color = RColorBrewer::brewer.pal(8, "Dark2"))

We can also compare word clouds for different subsets of the corpus:

corpus_subset(corp_us,
              President %in% c("Obama", "Trump", "Clinton")) %>%
  tokens(remove_punct = T,
        remove_numbers= T,
        remove_symbols = T) %>%
  tokens_group(groups = President) %>%
  dfm() %>%
  dfm_remove(stopwords("en")) %>%
    dfm_trim(min_termfreq = 5, 
             termfreq_type = "count") %>%
    textplot_wordcloud(comparison = TRUE)

13.12 Keyness Analysis

When you have two collections of texts, we can use quantitative methods to identify which words are more strongly associated with one of the two sub-corpora. This is the idea of keyword analysis.

13.13 Flowchart

Finally, Figure 13.1 below provides a summary flowchart for computatutional text analytics in R.

Figure 13.1: Computational Text Processing Flowchart in R

13.14 Exercises

In the following exercise, please use the dataset demo_data/TW_President.tar.gz, which is a text collection of the inaugural speeches of Taiwan Presidents (till 2016).

You may load the entire text collection as a corpus object using the following code:

require(readtext)
require(quanteda)
corp_tw <- readtext("demo_data/TW_President.tar.gz") %>%
  corpus

Exercise 13.3 Please create a data frame, which includes the metadata information for each text. You may start with a data frame you get from summary(corp_tw) and then create two additional columns—President and Year, which can be extracted from the text filenames in the Text column.

Hint: tidyr::extract()

summary(corp_tw) %>% as_tibble

After you create the metadata DF, please assign it to the docvars(corp_tw) for later analysis.

Exercise 13.4 Please create a lexical density plot for the use of “台灣” in all presidents’ texts.

Exercise 13.5 Please create a word cloud of the entire corpus corp_tw. In the word cloud, please remove punctuations, numbers, and symbols. The word cloud has to only include words whose frequency >= 20.

Exercise 13.6 Create word clouds showing the comparison of President Tsai (CAYANGWEN), Ma (MAYANGJIU), and Shuibian Chen (CHENSHUIBIAN).

Exercise 13.7 Please create a keyness plot, showing the preferred words used by President Tsai (CAYANGWEN) vs. President Ma (MAYANGJIU).