Environment Setup
R and RStudio
In this workshop, I expect that all the participants have installed an working R environment and the IDE RStudio in their workstation.
Libraries
There are many useful and effective R libraries for text processing. I would highly recommend the official webpage, CRAN Task Views, which provides a theme-based categorization of the R libraries.
Of particular relevance in this workshop is the CRAN Task Topic – Natural Language Processing.
Specifically, in this workshop, I will use the following libraries. Based on my experiences with data processing, I think these libraries are quite effective and easy to learn.
tidyverse
: This library includes a collection of useful R libraries for data analysis.quanteda
: This library provides many useful functions for exploratory quantitative text analysis.tidytext
: This library provides many useful functions for computational text analysis under the tidyverse framework.jiebar
: This library provides functions for Chinese tokenization (i.e., word segmentation).readtext
: This library provides functions for effectively importing text files into R.
These libraries can be easily installed in R using the following codes:
## Install these libraries if you haven't
install.packages("tidyverse")
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("quanteda.textstats")
install.packages("readtext")
install.packages("tidytext")
install.packages("jiebaR")
install.packages("wordcloud2")
Environment Checking
The R Version to produce these lecture notes: R version 4.1.2 (2021-11-01).
After you install the libraries, you can import these libraries in the current R session for the later use:
## Importing libraries
library(tidyverse)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(readtext)
library(tidytext)
library(jiebaR)
library(wordcloud2)
## Checking library versions
packageVersion("tidyverse")
1] '1.3.1'
[packageVersion("quanteda")
1] '3.2.1'
[packageVersion("quanteda.textplots")
1] '0.94.1'
[packageVersion("quanteda.textstats")
1] '0.95'
[packageVersion("tidytext")
1] '0.3.2'
[packageVersion("jiebaR")
1] '0.11'
[packageVersion("wordcloud2")
1] '0.2.2' [
Or alternatively, we can run the function sessionInfo()
to get an overview of the current R environment:
## Check the session info
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] wordcloud2_0.2.2 jiebaR_0.11
[3] jiebaRD_0.1 tidytext_0.3.2
[5] readtext_0.81 quanteda.textstats_0.95
[7] quanteda.textplots_0.94.1 quanteda_3.2.1
[9] forcats_0.5.1 stringr_1.4.0
[11] dplyr_1.0.7 purrr_0.3.4
[13] readr_2.1.1 tidyr_1.1.4
[15] tibble_3.1.6 ggplot2_3.3.5
[17] tidyverse_1.3.1 showtext_0.9-4
[19] showtextdb_3.0 sysfonts_0.8.5
loaded via a namespace (and not attached):
[1] fs_1.5.2 lubridate_1.8.0 httr_1.4.2 SnowballC_0.7.0
[5] tools_4.1.2 backports_1.4.1 bslib_0.3.1 utf8_1.2.2
[9] R6_2.5.1 DBI_1.1.2 colorspace_2.0-2 withr_2.4.3
[13] tidyselect_1.1.1 compiler_4.1.2 cli_3.1.0 rvest_1.0.2
[17] xml2_1.3.3 bookdown_0.24 sass_0.4.0 scales_1.1.1
[21] digest_0.6.29 rmarkdown_2.11 pkgconfig_2.0.3 htmltools_0.5.2
[25] dbplyr_2.1.1 fastmap_1.1.0 highr_0.9 htmlwidgets_1.5.4
[29] rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
[33] generics_0.1.1 jsonlite_1.7.2 tokenizers_0.2.1 magrittr_2.0.1
[37] Matrix_1.4-0 Rcpp_1.0.8.3 munsell_0.5.0 fansi_0.5.0
[41] lifecycle_1.0.1 stringi_1.7.6 yaml_2.2.1 grid_4.1.2
[45] crayon_1.4.2 lattice_0.20-45 haven_2.4.3 hms_1.1.1
[49] knitr_1.37 pillar_1.6.4 stopwords_2.3 fastmatch_1.1-3
[53] reprex_2.0.1 glue_1.6.0 evaluate_0.14 data.table_1.14.2
[57] RcppParallel_5.1.4 modelr_0.1.8 vctrs_0.3.8 tzdb_0.2.0
[61] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.29
[65] broom_0.7.11 janeaustenr_0.1.5 nsyllable_1.0 ellipsis_0.3.2
If your R version is older than the above one, please consider updating your R. Details about updating R can be found in: