Introduction

The theme of this workshop is Chinese text processing with R. I will introduce a few useful R libraries that can be utilized for text analytic tasks. Also, I will focus on an important step in Chinese processing, namely, the word segmentation/tokenization, and show how to attend to this step in the pipeline. Finally, I will demonstrate a few potential applications of data analysis & visualization with the help of the attractive informative graphs with R.

Structure of the Workshop

Environment Setup
Loading Text Data
Chinese Word Segmentation
Applications

Data

All the data sets and the R scripts used in this workshop can be downloaded here as a zipped file: WorkshopHKSYU.zip. (Click the Download button to download everything as a zipped file).

After downloading, unzip the file and you will see a new directory, WorkshopHKSYU, in the directory where you save the zipped file.

In this WorkshopHKSYU directory, you will find all the R script files (WorkshopHKSYU/*.R) and a sub-directory (WorkshopHKSYU/demo_data) required for this workshop.

Please double click the R script file to start your RStudio/R.

Materials

In the workshop notes, the text boxes in light blue refer to the R codes that you need to run in the R console. The text boxes in black background show the outputs of the code processing.

We will follow this presentation convention throughout the entire notes.

print('Hello! R!')

[1] "Hello! R!"

Resources

R Fundamentals (Wickham and Grolemund 2017; Davies 2016)

Corpus Processing (Gries 2016; Stefanowitsch 2019)

Statistics (Gries 2021; Baayen 2008; Brezina 2018; Winter 2020)

References

Baayen, R. H. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.

Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

Davies, Tilman M. 2016. The Book of R: A First Course in Programming and Statistics. 1st ed. San Franscisco, US: No Starch Press, Inc.

Gries, Stefan T. 2016. Quantitative Corpus Linguistics with R: A Practical Introduction. 2nd ed. Routledge.

———. 2021. Statistics for Linguistics with R: A Practical Introduction. 3rd Edition. 3rd ed. Walter de Gruyter.

Stefanowitsch, Anatol. 2019. Corpus Linguistics: A Guide to the Methodology. Language Science Press. http://langsci-press.org/catalog/book/148.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.

Winter, Bodo. 2020. Statistics for Linguists: An Introduction Using R. Routledge.

Processing Chinese Data with R