Loading Text Data

Text Data Types

There are generally three types of formats for text data storage on the hard drive:

  • CSV: demo_data/song-jay-amei-v1.csv
  • Directory: demo_data/corp_dir
  • A Zipped File: demo_data/song-jay-amei-v2.tar.gz

In this workshop, I will show you how to load the data of these three types into R.

Data Description

In this workshop, we will use the song lyrics data set for illustration. It is a text collection of song lyrics from two artists: Jay Chou (周杰倫) and Amei Chang (張惠妹).

CSV Format

For small data sets, people often store their textual data as a spreadsheet-like table. We can easily load this type of data using read_csv() (or read_tsv()).

## CSV Format
corp <- read_csv(file = "demo_data/song-jay-amei-v1.csv", ## filename
                 locale = locale(encoding = "UTF-8")) ## encoding
corp

If you see garbled characters in the data frame, that may be due to the issue of encoding. Please see Encoding Issues.

Directory Format

Sometimes, if we have a lot of text documents, we may include all the documents of the corpus in one directory. With the data structure like this, we can load the entire corpus directory, using readtext():

## Directory Format
corp <- readtext(file="demo_data/corp_dir",
                 docvarsfrom = "filenames",  ## parsing filenames
                 dvsep="-", ## delimiter
                 docvarnames = c("artist", "index"), ## doc-level info
                 encoding = "UTF-8") ## encoding
corp

Very often the filenames of the text files are meaningful when we create the corpus. That is, we often include metadata information of each document in the filenames.

For example, the text filename Amei-1.txt indicates both the artist’s name as well as its unique integer index

So when we load the text collection, readtext() comes with the functionality to parse these important pieces of metadata information from the filenames.

Zipped File Format

When the corpus is large, people may zip the entire corpus as an archived zipped file to save storage space on the hard drive. To load the data set like this, we can also use readtext().

## Zipped Format
corp <- readtext(file = "demo_data/song-jay-amei-v2.tar.gz",
                 docvarsfrom = "filenames",
                 dvsep="-",
                 docvarnames = c("artist", "index"),
                 encoding = "UTF-8")
corp

The function readtext() works on:

  • text (.txt) files;
  • comma-separated-value (.csv) files;
  • XML formatted data;
  • data from the Facebook API, in JSON format;
  • data from the Twitter API, in JSON format; and
  • generic JSON data.

Encoding Issues

In computational text processing, the default encoding of text data is usually UTF-8. However, not all operating systems take UTF-8 as the default encoding for text file management.

This has a lot to do with the operating system locale. In R, we can check the system locale with:

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

For Windows users, if your locale is English, you may see:

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

If your locale is Chinese, you may see:

[1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950"

For Mac, the default encoding is UTF-8.

For Windows, things can get more complicated. In general, the default encoding in Windows is locale-dependent (i.e., depending on the OS language setting: big-5 for Chinese-Taiwan).

In Windows, we can change the locale in R as below:

## from Chinese to English
Sys.setlocale(category = "LC_ALL", locale = "UTF-8")

## from English to traditional Chinese
Sys.setlocale(category = "LC_ALL", locale = "cht")
Sys.setlocale(category = "LC_ALL", locale = "chs")

To process Chinese data with R, for Windows users, please set the locale to the Chinese language.