Environment Setup

R and RStudio

In this workshop, I expect that all the participants have installed an working R environment and the IDE RStudio in their workstation.

Libraries

There are many useful and effective R libraries for text processing. I would highly recommend the official webpage, CRAN Task Views, which provides a theme-based categorization of the R libraries.

Of particular relevance in this workshop is the CRAN Task Topic – Natural Language Processing.

Specifically, in this workshop, I will use the following libraries. Based on my experiences with data processing, I think these libraries are quite effective and easy to learn.

  • tidyverse: This library includes a collection of useful R libraries for data analysis.
  • quanteda: This library provides many useful functions for exploratory quantitative text analysis.
  • tidytext: This library provides many useful functions for computational text analysis under the tidyverse framework.
  • jiebar: This library provides functions for Chinese tokenization (i.e., word segmentation).
  • readtext: This library provides functions for effectively importing text files into R.

These libraries can be easily installed in R using the following codes:

## Install these libraries if you haven't
install.packages("tidyverse")
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("quanteda.textstats")
install.packages("readtext")
install.packages("tidytext")
install.packages("jiebaR")
install.packages("wordcloud2")

Environment Checking

The R Version to produce these lecture notes: R version 4.1.2 (2021-11-01).

After you install the libraries, you can import these libraries in the current R session for the later use:

## Importing libraries
library(tidyverse)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(readtext)
library(tidytext)
library(jiebaR)
library(wordcloud2)


## Checking library versions
packageVersion("tidyverse")
[1] '1.3.1'
packageVersion("quanteda")
[1] '3.2.1'
packageVersion("quanteda.textplots")
[1] '0.94.1'
packageVersion("quanteda.textstats")
[1] '0.95'
packageVersion("tidytext")
[1] '0.3.2'
packageVersion("jiebaR")
[1] '0.11'
packageVersion("wordcloud2")
[1] '0.2.2'

Or alternatively, we can run the function sessionInfo() to get an overview of the current R environment:

## Check the session info
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] wordcloud2_0.2.2          jiebaR_0.11              
 [3] jiebaRD_0.1               tidytext_0.3.2           
 [5] readtext_0.81             quanteda.textstats_0.95  
 [7] quanteda.textplots_0.94.1 quanteda_3.2.1           
 [9] forcats_0.5.1             stringr_1.4.0            
[11] dplyr_1.0.7               purrr_0.3.4              
[13] readr_2.1.1               tidyr_1.1.4              
[15] tibble_3.1.6              ggplot2_3.3.5            
[17] tidyverse_1.3.1           showtext_0.9-4           
[19] showtextdb_3.0            sysfonts_0.8.5           

loaded via a namespace (and not attached):
 [1] fs_1.5.2           lubridate_1.8.0    httr_1.4.2         SnowballC_0.7.0   
 [5] tools_4.1.2        backports_1.4.1    bslib_0.3.1        utf8_1.2.2        
 [9] R6_2.5.1           DBI_1.1.2          colorspace_2.0-2   withr_2.4.3       
[13] tidyselect_1.1.1   compiler_4.1.2     cli_3.1.0          rvest_1.0.2       
[17] xml2_1.3.3         bookdown_0.24      sass_0.4.0         scales_1.1.1      
[21] digest_0.6.29      rmarkdown_2.11     pkgconfig_2.0.3    htmltools_0.5.2   
[25] dbplyr_2.1.1       fastmap_1.1.0      highr_0.9          htmlwidgets_1.5.4 
[29] rlang_0.4.12       readxl_1.3.1       rstudioapi_0.13    jquerylib_0.1.4   
[33] generics_0.1.1     jsonlite_1.7.2     tokenizers_0.2.1   magrittr_2.0.1    
[37] Matrix_1.4-0       Rcpp_1.0.8.3       munsell_0.5.0      fansi_0.5.0       
[41] lifecycle_1.0.1    stringi_1.7.6      yaml_2.2.1         grid_4.1.2        
[45] crayon_1.4.2       lattice_0.20-45    haven_2.4.3        hms_1.1.1         
[49] knitr_1.37         pillar_1.6.4       stopwords_2.3      fastmatch_1.1-3   
[53] reprex_2.0.1       glue_1.6.0         evaluate_0.14      data.table_1.14.2 
[57] RcppParallel_5.1.4 modelr_0.1.8       vctrs_0.3.8        tzdb_0.2.0        
[61] cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1   xfun_0.29         
[65] broom_0.7.11       janeaustenr_0.1.5  nsyllable_1.0      ellipsis_0.3.2    

If your R version is older than the above one, please consider updating your R. Details about updating R can be found in: