Loading Text Data
Text Data Types
There are generally three types of formats for text data storage on the hard drive:
- CSV:
demo_data/song-jay-amei-v1.csv
- Directory:
demo_data/corp_dir
- A Zipped File:
demo_data/song-jay-amei-v2.tar.gz
In this workshop, I will show you how to load the data of these three types into R.
Data Description
In this workshop, we will use the song lyrics data set for illustration. It is a text collection of song lyrics from two artists: Jay Chou (周杰倫) and Amei Chang (張惠妹).
CSV Format
For small data sets, people often store their textual data as a spreadsheet-like table. We can easily load this type of data using read_csv()
(or read_tsv()
).
## CSV Format
<- read_csv(file = "demo_data/song-jay-amei-v1.csv", ## filename
corp locale = locale(encoding = "UTF-8")) ## encoding
corp
If you see garbled characters in the data frame, that may be due to the issue of encoding. Please see Encoding Issues.
Directory Format
Sometimes, if we have a lot of text documents, we may include all the documents of the corpus in one directory. With the data structure like this, we can load the entire corpus directory, using readtext()
:
## Directory Format
<- readtext(file="demo_data/corp_dir",
corp docvarsfrom = "filenames", ## parsing filenames
dvsep="-", ## delimiter
docvarnames = c("artist", "index"), ## doc-level info
encoding = "UTF-8") ## encoding
corp
Very often the filenames of the text files are meaningful when we create the corpus. That is, we often include metadata information of each document in the filenames.
For example, the text filename Amei-1.txt
indicates both the artist’s name as well as its unique integer index
So when we load the text collection, readtext()
comes with the functionality to parse these important pieces of metadata information from the filenames.
Zipped File Format
When the corpus is large, people may zip the entire corpus as an archived zipped file to save storage space on the hard drive. To load the data set like this, we can also use readtext()
.
## Zipped Format
<- readtext(file = "demo_data/song-jay-amei-v2.tar.gz",
corp docvarsfrom = "filenames",
dvsep="-",
docvarnames = c("artist", "index"),
encoding = "UTF-8")
corp
The function readtext()
works on:
- text (.txt) files;
- comma-separated-value (.csv) files;
- XML formatted data;
- data from the Facebook API, in JSON format;
- data from the Twitter API, in JSON format; and
- generic JSON data.
Encoding Issues
In computational text processing, the default encoding of text data is usually UTF-8
. However, not all operating systems take UTF-8
as the default encoding for text file management.
This has a lot to do with the operating system locale. In R, we can check the system locale with:
Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
For Windows users, if your locale is English, you may see:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
If your locale is Chinese, you may see:
[1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950"
For Mac, the default encoding is UTF-8
.
For Windows, things can get more complicated. In general, the default encoding in Windows is locale-dependent (i.e., depending on the OS language setting: big-5
for Chinese-Taiwan).
In Windows, we can change the locale in R as below:
## from Chinese to English
Sys.setlocale(category = "LC_ALL", locale = "UTF-8")
## from English to traditional Chinese
Sys.setlocale(category = "LC_ALL", locale = "cht")
Sys.setlocale(category = "LC_ALL", locale = "chs")
To process Chinese data with R, for Windows users, please set the locale to the Chinese language.