Introduction

The theme of this workshop is Chinese text processing with R. I will introduce a few useful R libraries that can be utilized for text analytic tasks. Also, I will focus on an important step in Chinese processing, namely, the word segmentation/tokenization, and show how to attend to this step in the pipeline. Finally, I will demonstrate a few potential applications of data analysis & visualization with the help of the attractive informative graphs with R.

Structure of the Workshop

  • Environment Setup
  • Loading Text Data
  • Chinese Word Segmentation
  • Applications

Data

All the data sets and the R scripts used in this workshop can be downloaded here as a zipped file: WorkshopHKSYU.zip. (Click the Download button to download everything as a zipped file).

After downloading, unzip the file and you will see a new directory, WorkshopHKSYU, in the directory where you save the zipped file.

In this WorkshopHKSYU directory, you will find all the R script files (WorkshopHKSYU/*.R) and a sub-directory (WorkshopHKSYU/demo_data) required for this workshop.

Please double click the R script file to start your RStudio/R.

Materials

In the workshop notes, the text boxes in light blue refer to the R codes that you need to run in the R console. The text boxes in black background show the outputs of the code processing.

We will follow this presentation convention throughout the entire notes.

print('Hello! R!')
[1] "Hello! R!"

Resources

References

Baayen, R. H. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.
Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.
Davies, Tilman M. 2016. The Book of R: A First Course in Programming and Statistics. 1st ed. San Franscisco, US: No Starch Press, Inc.
Gries, Stefan T. 2016. Quantitative Corpus Linguistics with R: A Practical Introduction. 2nd ed. Routledge.
———. 2021. Statistics for Linguistics with R: A Practical Introduction. 3rd Edition. 3rd ed. Walter de Gruyter.
Stefanowitsch, Anatol. 2019. Corpus Linguistics: A Guide to the Methodology. Language Science Press. http://langsci-press.org/catalog/book/148.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.
Winter, Bodo. 2020. Statistics for Linguists: An Introduction Using R. Routledge.