Processing Chinese Data with R
April 26, 2022, @ HKSYU (Online Workshop)
Introduction
The theme of this workshop is Chinese text processing with R. I will introduce a few useful R libraries that can be utilized for text analytic tasks. Also, I will focus on an important step in Chinese processing, namely, the word segmentation/tokenization, and show how to attend to this step in the pipeline. Finally, I will demonstrate a few potential applications of data analysis & visualization with the help of the attractive informative graphs with R.
Structure of the Workshop
- Environment Setup
- Loading Text Data
- Chinese Word Segmentation
- Applications
Data
All the data sets and the R scripts used in this workshop can be downloaded here
as a zipped file: WorkshopHKSYU.zip
. (Click the Download button to download everything as a zipped file).
After downloading, unzip the file and you will see a new directory, WorkshopHKSYU
, in the directory where you save the zipped file.
In this WorkshopHKSYU
directory, you will find all the R script files (WorkshopHKSYU/*.R
) and a sub-directory (WorkshopHKSYU/demo_data
) required for this workshop.
Please double click the R script file to start your RStudio/R.
Materials
In the workshop notes, the text boxes in light blue refer to the R codes that you need to run in the R console. The text boxes in black background show the outputs of the code processing.
We will follow this presentation convention throughout the entire notes.
print('Hello! R!')
[1] "Hello! R!"
Resources
- R Fundamentals (Wickham and Grolemund 2017; Davies 2016)
- Corpus Processing (Gries 2016; Stefanowitsch 2019)
- Statistics (Gries 2021; Baayen 2008; Brezina 2018; Winter 2020)