Chapter 2 R Fundamentals

2.1 Installing R

Download the installation file: http://cran.r-project.org

2.2 Installing RStudio

After you install R, you may install RStudio. RStudio is an editor which can help you write R codes. A good analogy is that R is the engine and Rstudio is the dashboard of the car.

Please download the right version that is compatible with your PC operating system.

  • Important notes:
    • Do not have Chinese characters in your directory names or on the path to the files
    • Do not have spaces and weird symbols in your file path:
      • D:/R
      • D:/Rstudio
      • /User/Alvinchen/

2.3 My Current Version

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.0    
 [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.1.8   
 [9] ggplot2_3.4.1   tidyverse_2.0.0 reticulate_1.28

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0 xfun_0.37        bslib_0.4.2      lattice_0.20-45 
 [5] colorspace_2.1-0 vctrs_0.5.2      generics_0.1.3   htmltools_0.5.5 
 [9] yaml_2.3.7       utf8_1.2.3       rlang_1.1.0      jquerylib_0.1.4 
[13] pillar_1.8.1     glue_1.6.2       withr_2.5.0      rappdirs_0.3.3  
[17] lifecycle_1.0.3  munsell_0.5.0    gtable_0.3.1     evaluate_0.20   
[21] labeling_0.4.2   knitr_1.42       tzdb_0.3.0       fastmap_1.1.1   
[25] fansi_1.0.4      highr_0.10       Rcpp_1.0.10      scales_1.2.1    
[29] formatR_1.14     cachem_1.0.7     jsonlite_1.8.4   farver_2.1.1    
[33] hms_1.1.2        png_0.1-8        digest_0.6.31    stringi_1.7.12  
[37] bookdown_0.32    icons_0.2.0      grid_4.2.2       cli_3.6.1       
[41] tools_4.2.2      magrittr_2.0.3   sass_0.4.5       pkgconfig_2.0.3 
[45] ellipsis_0.3.2   Matrix_1.5-3     xml2_1.3.3       timechange_0.2.0
[49] rmarkdown_2.20   rstudioapi_0.14  R6_2.5.1         compiler_4.2.2  

2.4 The Interface of Rstudio

When you start Rstudio, you will see an interface as follows:

Rstudio Interface

Figure 2.1: Rstudio Interface

  • Rstudio Interface:
    • Editor: You create and edit R-related files here (e.g., *.r, *.Rmd etc.)
    • Console: This is the R engine, which runs the codes we send out either from the R-script file or directly from the console input
    • Output: You can view graphic outputs here

The R console is like a calculator. You can type any R code in the console after the prompt > and run the code line by line by pressing enter.

1 + 1
[1] 2
log(10)
[1] 2.302585
1:5
[1] 1 2 3 4 5

Or alternatively, we can create an R script in Rstudio and write down lines of R codes to be passed to the R console. This way, we can run the whole script all at once. This is the idea of writing a program.

In the above example (Figure 2.1), I wrote a few lines of codes in a R script file (cf. the Editor frame) and asked R to run these lines of codes in the R Console. And the graphic output of the R script was printed in the Output frame.

Exercise 2.1 Please create a new R script in Rstudio. You may name the script as “ch-2-NAME.R” (Please use your own name). Please write the following codes in the script and pass the whole script to the R Console.

scores <- rnorm(1000, mean = 75, sd = 5.8)
plot(density(scores))
hist(scores)
boxplot(scores)

Exercise 2.2 Find the answer to the following mathematical calculation with R.

\(2^{2+1}-4+64^{(-2)^{2.25-\frac{1}{4}}}\)

= 16777220

2.5 Assignment

R works with objects of many different classes, some of which are defined in the base R while others are defined by specific libraries/environments/users.

You can assign any object created in R to a variable name using <-:

x <- 5
y <- "wonderful"

Now the objects are stored in the variables. You can print out the variables by either making use of the auto-printing (i.e., the variable name itself auto-prints its content) or print():

x
[1] 5
print(x)
[1] 5
y
[1] "wonderful"
print(y)
[1] "wonderful"

2.6 Data Structure

In R, the most primitive object is a vector. There are three types of primitive vectors: (a) numeric, (b) character, and (c) Boolean vectors. In our previous examples, x is a numeric vector of one element; y is a character vector of one element. The following code shows an example of a Boolean vector z.

z <- TRUE
z
[1] TRUE

All elements in the vector have to be of the same data type.

The vectors we’ve created so far are vectors of only ONE ELEMENT. You use c() to create a vector of multiple elements. Within the parenthesis, you concatenate each element of the vector by ,:

x2 <- c(1, 2, 3, 4, 5, 6)
x2
[1] 1 2 3 4 5 6
y2 <- c("wonderful", "excellent", "brilliant")
y2
[1] "wonderful" "excellent" "brilliant"
z2 <- c(TRUE, FALSE, TRUE)
z2
[1]  TRUE FALSE  TRUE

Other data structures that we often work with include:

  • List: a vector-like structure, but can consist of elements of different data types
  • Matrix: a two-dimensional vector, where all elements have to be of the same data type
  • Data Frame: a spreadsheet-like table, where columns can be of different data types
ex_list <- list("First element", 5:10, TRUE)
print(ex_list)
[[1]]
[1] "First element"

[[2]]
[1]  5  6  7  8  9 10

[[3]]
[1] TRUE
ex_array <- matrix(c(1,5,6,3,8,19),byrow = T, nrow = 2)
ex_array
     [,1] [,2] [,3]
[1,]    1    5    6
[2,]    3    8   19
ex_df <- data.frame(
  WORD = c("the", "boy", "you","him"),
  POS = c("ART","N","PRO","PRO"),
  FREQ = c(1104,35, 104, 34)
)
ex_df

The following graph shows you an intuitive understanding of the data structures in R. We will discuss more on data structures in Chapter 4.

2.7 Function

Function is also an object class. There are many functions pre-defined in the R-base libraries.

class(c)
[1] "function"
class(vector)
[1] "function"
class(print)
[1] "function"

To instruct R to do things more precisely, a function call usually has many parameters to specify. Take the earlier function matrix() for example. It is a pre-defined function in the R base library.

ex_array <- matrix(c(1,5,6,3,8,19),byrow = T, nrow = 2)
ex_array
     [,1] [,2] [,3]
[1,]    1    5    6
[2,]    3    8   19

When creating a matrix, we specify the values for the parameters, byrow = and nrow =. These specifications provide clues for R to create a matrix with N rows and arrange the numbers by rows. The actual values of the parameters that we use, i.e., T and 2, are referred to as arguments.

Parameter is a variable in the declaration of function. Argument is the actual value of this variable that gets passed to function.


Most importantly, we can define our own function, which is tailored to perform specific tasks. All self-created functions need to be defined first in the R environment before you can call them.

  • Define own functions:
print_out_user_name <- function(name = ""){
  cat("The current username is: ", name, "\n")
}
  • Call own functions:
print_out_user_name(name = "Alvin Cheng-Hsien Chen")
The current username is:  Alvin Cheng-Hsien Chen 
print_out_user_name(name = "Ada Lovelace")
The current username is:  Ada Lovelace 

Exercise 2.3 Please define a function called make_students_happy(), which takes a multi-element numeric vector, and returns also a numeric vector, with the value of each element to be the square root of the original value multiplied by 10.

student_current_scores <- c(20, 34, 60, 87, 100)
make_students_happy(old_scores = student_current_scores)
[1]  44.72136  58.30952  77.45967  93.27379 100.00000

2.8 Vectorization

Many operations in R are vectorized, meaning that operations occur in parallel in certain R objects. This allows you to write code that is efficient, concise, and easier to read than in non-vectorized languages.

The simplest example of vectorized functions is when adding two vectors together.

x <- 1:4
y <- 6:9 
z <- x + y
z
[1]  7  9 11 13

Without vectorization, you may need to do the element-wise vector adding as follows:

z <- numeric(length = length(x))

for(i in 1:length(x)){
  z[i] <- x[i]+y[i]
} # endfor

z
[1]  7  9 11 13

Other common vectorized functions include:

x >= 5
[1] FALSE FALSE FALSE FALSE
x < 2
[1]  TRUE FALSE FALSE FALSE
y == 8
[1] FALSE FALSE  TRUE FALSE

For more information on vectorization, please watch the following YouTube clip from Roger Peng.

2.9 Script

In our earlier demonstrations, we ran R codes by entering each procedure line by line. We can create one script file with the Editor of Rstudio and include all our R codes in the file, which usually has the file extension of .R. And then we can run all commands included in the whole script all at once in the Rstudio (i.e., sending everything in the script file to the R console).

First you open the *.R script file in Rstudio, which should appear in the Editor frame of the Rstudio. To run the whole script from start to the end, select all lines in the script file and press ctrl/cmd + shift + enter. To run a particular line of the script, put your mouse in the line and press ctrl/cmd + enter.

2.10 Library

R, like other programming languages, comes with a huge database of packages and extensions, allowing us to do many different tasks without worrying about writing the codes all from the scratch. In CRAN Task Views, you can find specific packages that you need for particular topics/tasks.

To install a package (i.e., library):

install.packages("tidyverse")
install.packages(c("ggplot2", "devtools", "dplyr"))

In this course, I would like to recommend all of you to install the package tidyverse, which is a bundle including several useful packages for data analysis. During the installation, if you are asked about whether to install the package from source, please enter yes (See below for more detail).

During the R package installation, if you see messages like installation of package XXX had non-zero exit status, this indicates that the package has NOT been properly installed in your R environment. That is, something is WRONG (See below as well). You need to figure out a way to solve the issues indicated in the error messages so that you can successfully install the package in your R system.

Before you install R packages from source, you need to install a few R tools for your operating system. These tools are necessary for you to build the R packages from the source.

For MacOS Catalina users, please install the following applications on your own. They are necessary for building R packages from source.

For Windows users, please install Rtools from CRAN (Please install the version according to your R version).

After you install all the above source-building tools, you can now install the package tidyverse from source. Please install the package from the source. This step is very important because some of the dependent packages require you to do so.

However, for the other packages, I would still recommend you to install the packages in a normal way, i.e., installing NOT from source, but from the compiled version on CRAN.

2.11 Setting

Always set your default encoding of the files to UTF-8:

2.12 Seeking Help

In the process of (learning) programming, one thing you will never be able to dodge is feeling desperate for help. Here are some useful sources from which you may get additional assistance.

Within Rstudio, in the R console, you can always use ? to check the documentation of a particular function (cf. Figure 2.2). When you run the command, you will see the documentation popping up in the output frame of the Rstudio.

?log
?read.table
Help 1

Figure 2.2: Help 1

If you need help from others, the first step is to create a reproducible example. The goal of a reproducible example is to package your problematic code in such a way that other people can run it and feel your pain. Then, hopefully, they can provide a solution and put you out of your misery.

Help 2

Figure 2.3: Help 2

So before you seek help from others (or before you yell at others for help, cf. Figure 2.3) :

  • First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any library() calls and create all necessary objects (e.g., files). The easiest way is to check the objects listed in the Environment tab of the Rstudio and identify objects that are relevant to your problematic code chunk.

  • Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one you’re facing in real life or even using built-in data.

That sounds like a lot of work! And it can be, but it has a great payoff:

80% of the time creating an excellent reproducible example reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.

The other 20% of time you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!

The following is a list of resources where people usually get external assistance quickly:

2.13 Language Learning Ain’t Easy!

Learning R is like learning another foreign language. It is a long journey. You can’t expect yourself to learn all the vocabulary of the new language in one day. Also, you will forget things you learn all the time. Everyone’s been there. When your script does not work as expected, don’t be frustrated. Take a break and resume later. What I can say is that: it is always NORMAL to debug a script for hours or even days via endless searches on Google.

That being said, here I would like to share with you some of the most common problems we may run into:

  • You created an R script file (*.r) and opened it in the Rstudio, but the script didn’t work simply because you didn’t execute the script in R console (i.e., you didn’t send the script to R console.)

  • If you get an error message, saying "object not found", check the object name again and see if you have mistyped the name of the object. If not, check your current environment and see if you have forgot to execute some assignment commands in the previous parts of the script (i.e., the object has NOT even been created yet).

  • If you get an error message, saying "function not found", check the function and see if you have the correct name. Or more often, check if you have properly loaded the necessary libraries where the function is defined.

  • To understand the meaning of the error messages is crucial to your development of R proficiency. To achieve this, you have to know very well every object name you have created in your script (as well as in your environment). For example:

    • What type of object is it? (i.e., the class of the object, e.g., vector, list, data.frame?)
    • For primitive vectors, what data type does the vector belong to? (e.g., numeric, character, boolean,factor?)
    • What is the dimensionality of the object? (nrows, ncols?)
  • Sometimes the script fails simply because of the obvious syntactic errors. Pay attention to all the punctuations in every R command. They are far more important (or lethal) than you think. They include:

    • ,: commas between arguments inside a function
    • ": quotes for strings/characters
    • (): parentheses for functions
    • {}: curly brackets for control structures
  • From my experiences, about 80 percent of the errors may in the end boil down to a simple typo. No kidding. Copy-and-paste helps.

  • DO NOT assume that your R script always works as intended! Always keep two questions in mind:

    • Did R produce the intended result?
    • What is included in the R object name?

2.14 Keyboard Shortcuts

The best way to talk to a computer is via the keyboard. Scripting requires a lot of typing. Keyboard shortcuts may save you a lot of time. Here are some of the handy shortcuts:

  • Crtl/Command + Enter: run the current line (send from the script to the console)
  • Crtl/Command + A: select all
  • Crtl/Command + C: copy
  • Ctrl/Command + X: cut
  • Ctrl/Command + V: paste
  • Ctrl/Command + Z: undo
  • (Mac) Alt/Option + Left/Right: move cursor by a word
  • (Windows) Ctrl + Left/Right: move cursor by a word
  • (Mac) Command + Left/Right: move cursor to the beginning/end of line
  • (Windows) Home/End: move cursor to the beginning/end of line
  • (Mac) Command + Tab: switch in-between different working windows/apps
  • Ctrl/Command + S: save file
  • Command + Shift + C: comment/uncomment selected lines

Exercise 2.4 Make yourself familiar with the iris data set, which is included in R.

Exercise 2.5 Use ? to make youself familiar with the following commands: str,summary, dim, colnames, names, nrow, ncol, head, and tail.

What information can you get with all these commands?

Exercise 2.6 Write a function to compute the factorial of a non-negative integer, x, expressed as x!. The factorial of x refers to x multiplied by the product of all integers less than x, down to 1.

For example, 3! = 3 x 2 x 1 = 6.

The special case, zero factorial is always defined as 1.

Confirm that your function produce the same results as below:

  • 5! = 120
  • 120! = 6.689503e+198
  • 0! = 1
# A Sample Format for your Function

myfac <- function(x){

}
##(i)
myfac(5)
[1] 120
##(ii)
myfac(120)
[1] 6.689503e+198
##(iii)
myfac(0)
[1] 1