Introduction to R and RStudio

Raphael Morsomme

Welcome

Meet the instructor

Headshot of Raphael Morsomme

Ph.D. candidate in statistics at Duke University
Research focus: stochastic epidemic models and breast cancer overdiagnosis
Fan of Formula 1 and model building
UCM alum of 2018

Objectives of the workshop

to introduce you to R and RStudio
to give you the tools to conduct the data analysis portion of your research project in R

Expectations

What background is assumed for the workshop? Everyone is welcome.
Will we be doing coding? Yes. We will use R.
What if I have never coded? I do not expect participants to have any experience with R or any other programming language.
Will we learn the mathematical theory of statistics? No. We will only learn to visualize and implement simple statistical models in R.

Outline

data analysis
R and RStudio
manipulating numbers, vectors and dataframes
visualization
modeling
visualizing models

Data analysis

What is data analysis?

“Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. (…) In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.”

Source: Wikipedia

Examples of data analyses in practice

`R` and `RStudio`

Why `R`

free and open-source,
large community of users (stackoverflow)
used in almost every industry
go-to programming language for statisticians, and
great tools for visualization

Getting started

Getting started with RStudio

Manipulating numbers

Using `R` as a calculator

Type the following in the Console (lower left corner in RStudio)

5 + 7

[1] 12

and press (ctrl/command + enter) to run the line of code.

Instead of typing, you can also simply copy and paste the code in the console.

Comments

What follows a “#” is a comment

2 ^ 10 # exponent

[1] 1024

The part 2 ^ 10 is run by R and the part # exponent is ignored by R.

Comments are useful for communicating with collaborators.

Calculating more complex expressions

(3 + 8) / (2 + 8 ^ 1.5)

[1] 0.4466567

((3 + 8) / (2 + 8 ^ 1.5))^(-5.65 / 4)^7

[1] 8445.545

The sky is the limit.

Assigning values

Suppose you want to use the number

((3 + 8) / (2 + 8 ^ 1.5))^(-5.65 / 4)^7

[1] 8445.545

for different purposes, e.g. multiply it, divide it, raise it to different powers, etc.

We can simply store this number by assign it to some object with the command =.

x = ((3 + 8) / (2 + 8 ^ 1.5))^(-5.65 / 4)^7

The object x is the number I wanted to save.

I can print the value of x

[1] 8445.545

I can multiply it, divide it, raise it to different powers

5*x

[1] 42227.72

x/2.5

[1] 3378.218

x^3

[1] 602397282682

Reconstructing `x`

Let us use what we just learned to reconstruct x from the numerator, denominator and exponent.

numer = 3 + 8
denom = 2 + 8 ^ 1.5
expon = (-5.65 / 4)^7

x_reconstructed = (numer / denom) ^ expon

The environment

Check the environment in the upper right corner. It contains the objects that you have created.

you should see the objects denom, expon, numer, x and x_reconstructed.

Note that x and x_reconstructed have the same value!

Simple operations

Here is a list of R commands for common mathematical operations

sin(2) # sine

[1] 0.9092974

cos(2) # cosine

[1] -0.4161468

log(10) # natural logarithm

[1] 2.302585

exp(4) # Euler's constant e (2.718) raised to the fourth power

[1] 54.59815

exp(log(10)) # nested functions

[1] 10

On the previous slide, note that the last line of code is equivalent to the following two lines

tmp = log(10)
exp(tmp)

[1] 10

Manipulating vectors

Vectors

A vector of length $k$ is simply a sequence of $k$ numbers.

In R, we create vectors with the c command

c(2, 5, 5, 8, 0)

[1] 2 5 5 8 0

Assigning a vector

I can of course assign a vector to an object

x = c(2, 5, 5, 8, 0) # replace previous object x
x

[1] 2 5 5 8 0

Overwriting objects

Note that this piece of code replaced the previous object x present in the environment. The previous value of x is now lost.

Operations on vectors

Here is a list of R commands for common operations on vectors

1.5 * x + 2

[1]  5.0  9.5  9.5 14.0  2.0

mean(x)

[1] 4

median(x)

[1] 5

max(x)

[1] 8

min(x)

[1] 0

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       2       5       4       5       8

var(x)

[1] 9.5

sd(x)

[1] 3.082207

Manipulating dataframes

Dataframes

A dataframe is simply a collection of vectors of same length.

Vectors: columns
Dataframes: rectangles

Libraries

Packages are user-created collections of new functions for R.

Packages need to be

installed on your machine once
loaded every session.

The package ggplot2 contains the dataframe mpg.

# install.packages("ggplot2") # install ggplot2; run only once
library(ggplot2) # load ggplot2; run every session

The `mpg` dataframe

For simplicity, I assign the dataframe mpg to the object d

d = mpg

I print its first few rows for inspection with the command head.

head(d)

# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

Dimensions of `mpg`

The commands nrow and ncol respectively provide the number of rows (observations) and columns (variables) of a dataframe.

nrow(d)

[1] 234

ncol(d)

[1] 11

Accessing variables

I use the command $ to access a variable

d$displ

  [1] 1.8 1.8 2.0 2.0 2.8 2.8 3.1 1.8 1.8 2.0 2.0 2.8 2.8 3.1 3.1 2.8 3.1 4.2
 [19] 5.3 5.3 5.3 5.7 6.0 5.7 5.7 6.2 6.2 7.0 5.3 5.3 5.7 6.5 2.4 2.4 3.1 3.5
 [37] 3.6 2.4 3.0 3.3 3.3 3.3 3.3 3.3 3.8 3.8 3.8 4.0 3.7 3.7 3.9 3.9 4.7 4.7
 [55] 4.7 5.2 5.2 3.9 4.7 4.7 4.7 5.2 5.7 5.9 4.7 4.7 4.7 4.7 4.7 4.7 5.2 5.2
 [73] 5.7 5.9 4.6 5.4 5.4 4.0 4.0 4.0 4.0 4.6 5.0 4.2 4.2 4.6 4.6 4.6 5.4 5.4
 [91] 3.8 3.8 4.0 4.0 4.6 4.6 4.6 4.6 5.4 1.6 1.6 1.6 1.6 1.6 1.8 1.8 1.8 2.0
[109] 2.4 2.4 2.4 2.4 2.5 2.5 3.3 2.0 2.0 2.0 2.0 2.7 2.7 2.7 3.0 3.7 4.0 4.7
[127] 4.7 4.7 5.7 6.1 4.0 4.2 4.4 4.6 5.4 5.4 5.4 4.0 4.0 4.6 5.0 2.4 2.4 2.5
[145] 2.5 3.5 3.5 3.0 3.0 3.5 3.3 3.3 4.0 5.6 3.1 3.8 3.8 3.8 5.3 2.5 2.5 2.5
[163] 2.5 2.5 2.5 2.2 2.2 2.5 2.5 2.5 2.5 2.5 2.5 2.7 2.7 3.4 3.4 4.0 4.7 2.2
[181] 2.2 2.4 2.4 3.0 3.0 3.5 2.2 2.2 2.4 2.4 3.0 3.0 3.3 1.8 1.8 1.8 1.8 1.8
[199] 4.7 5.7 2.7 2.7 2.7 3.4 3.4 4.0 4.0 2.0 2.0 2.0 2.0 2.8 1.9 2.0 2.0 2.0
[217] 2.0 2.5 2.5 2.8 2.8 1.9 1.9 2.0 2.0 2.5 2.5 1.8 1.8 2.0 2.0 2.8 2.8 3.6

This is simply a vector. I can assign it to a object and manipulate it.

x = d$displ
mean(x)

[1] 3.471795

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.400   3.300   3.472   4.600   7.000

var(x)

[1] 1.669158

var(d$displ) # equivalent to the previous line

[1] 1.669158

Break

Visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

Types of variables

Roughly speaking, there are two types of variables

numerical (height, age, number of siblings, etc)
categorical (eye color, level of education, place of birth, etc)

Histograms

A histogram summarizes the distribution of a continuous variable. Simply use the command hist.

hist(x = d$cty)

Title, axis label and number of bins

We can improve the figure by adding a title, improving the x-axis label and changing the number of breaks

hist(
  x = d$cty,
  xlab = "City Miles per Gallon",
  main = "Fuel Consumption in City",
  breaks = 15
  )

Help files

The help file of a R function describes

what a function does,
what arguments are available (e.g. xlab, main, breaks)
a few examples (often)

To access the help file of a function, simple use the command help

help(hist)

Boxplots

Boxplots are a more compact way than histograms to summarizes the distribution of a continuous variable. To make a boxplot, simply use the command boxplot.

boxplot(x = d$hwy)

Figure editing

Again, we can add a title and a y-axis label to improve the figure.

boxplot(
  x = d$hwy, 
  ylab = "Miles per Gallon",
  main = "Fuel Consumption on Highway"
  )

Scatterplot

Scatterplot are used to visualize the relationship between two continuous variables

plot(x = d$cty, y = d$hwy)

Figure editing

plot(
  x = d$cty,
  y = d$hwy,
  main = "Fuel Consumption of 38 Car Models",
  xlab = "City Miles per Gallon",
  ylab = "Highway Miles per Gallon"
  )

Tables

To summarize a categorical variable, we can use a table

table(d$drv) # 4-wheel, front-wheel, rear wheel drive


  4   f   r 
103 106  25

Table for numerical variable?

What happens if we make a table for a numerical variable?

table(d$hwy) # meaningful?


12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 41 
 5  2 10  7 31 10 13 11  2  7  7 13 15 32 14  7 22  4  7  4  2  1  2  2  1  1 
44 
 2

This is not helpful! A histogram would be much better.

R is only a tool

R is only a tool. It will let you use it to do stupid things!

Two-way tables

To look at the relation between two categorical variables, we can use a two-way table

table(d$drv, d$fl) # drive and fuel type

   
     c  d  e  p  r
  4  0  2  6 20 75
  f  1  3  1 25 76
  r  0  0  1  7 17

Modeling (requires statistics)

t-test

Is the fuel consumption in the city and on the highway the same?

t.test(d$cty, d$hwy) # two-sample t-test


    Welch Two Sample t-test

data:  d$cty and d$hwy
t = -13.755, df = 421.79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.521683 -5.640710
sample estimates:
mean of x mean of y 
 16.85897  23.44017

Scientific notation

e-16 is equivalent to the scientific notation $10^{-16}$.

Paired t-test

Actually, we should use a paired t-test in this case.

t.test(d$cty, d$hwy, paired = TRUE) # paired t-test


    Paired t-test

data:  d$cty and d$hwy
t = -44.492, df = 233, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.872628 -6.289765
sample estimates:
mean of the differences 
              -6.581197

Simple linear regression

m_simple = lm(hwy ~ cty, data = d)
summary(m_simple)


Call:
lm(formula = hwy ~ cty, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3408 -1.2790  0.0214  1.0338  4.0461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.89204    0.46895   1.902   0.0584 .  
cty          1.33746    0.02697  49.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.752 on 232 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.9134 
F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

Multiple linear regression

m_multiple = lm(hwy ~ cty + displ, data = d)
summary(m_multiple)


Call:
lm(formula = hwy ~ cty + displ, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3124 -1.2423  0.0053  1.0296  4.1243 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.15145    1.21271   0.949    0.343    
cty          1.32914    0.04490  29.602   <2e-16 ***
displ       -0.03432    0.14791  -0.232    0.817    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.756 on 231 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.913 
F-statistic:  1224 on 2 and 231 DF,  p-value: < 2.2e-16

Interactions

m_interaction = lm(hwy ~ cty + displ + cty:displ, data = d)
summary(m_interaction)


Call:
lm(formula = hwy ~ cty + displ + cty:displ, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2546 -1.0888  0.1424  0.8978  4.1637 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.32675    1.48875   3.578 0.000422 ***
cty          1.03665    0.07795  13.299  < 2e-16 ***
displ       -1.69320    0.39471  -4.290 2.64e-05 ***
cty:displ    0.12029    0.02670   4.505 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.687 on 230 degrees of freedom
Multiple R-squared:  0.9208,    Adjusted R-squared:  0.9198 
F-statistic: 891.2 on 3 and 230 DF,  p-value: < 2.2e-16

Or simply

m_interaction2 = lm(hwy ~ cty*displ, data = d)
summary(m_interaction2)


Call:
lm(formula = hwy ~ cty * displ, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2546 -1.0888  0.1424  0.8978  4.1637 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.32675    1.48875   3.578 0.000422 ***
cty          1.03665    0.07795  13.299  < 2e-16 ***
displ       -1.69320    0.39471  -4.290 2.64e-05 ***
cty:displ    0.12029    0.02670   4.505 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.687 on 230 degrees of freedom
Multiple R-squared:  0.9208,    Adjusted R-squared:  0.9198 
F-statistic: 891.2 on 3 and 230 DF,  p-value: < 2.2e-16

ANOVA

m_anova = aov(hwy ~ class, data = d)
summary(m_anova)

             Df Sum Sq Mean Sq F value Pr(>F)    
class         6   5683   947.2   83.39 <2e-16 ***
Residuals   227   2578    11.4                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise t-tests

pairwise.t.test(
  x = d$hwy, # response variable
  g = d$class, # grouping variable
  p.adjust.method   = "bonferroni" # Bonferroni correction to control type-I error rate.
  )


    Pairwise comparisons using t tests with pooled SD 

data:  d$hwy and d$class 

           2seater compact midsize minivan pickup  subcompact
compact    0.59562 -       -       -       -       -         
midsize    1.00000 1.00000 -       -       -       -         
minivan    1.00000 7.1e-06 0.00052 -       -       -         
pickup     3.9e-05 < 2e-16 < 2e-16 0.00011 -       -         
subcompact 0.82210 1.00000 1.00000 2.9e-05 < 2e-16 -         
suv        0.00064 < 2e-16 < 2e-16 0.00335 1.00000 < 2e-16   

P value adjustment method: bonferroni

Visualizing models

Visualizing simple regression

plot(
  x = d$cty,
  y = d$hwy,
  main = "Fuel Economy of 38 Models of Cars",
  xlab = "City Miles per Gallon",
  ylab = "Highway Miles per Gallon"
  )
abline(m_simple, col = "red") # fitted regression curve

Visualizing ANOVA

boxplot( # complements m5 (ANOVA)
  d$hwy ~ d$class, 
  ylab = "Highway Miles per Gallon",
  xlab = "Type of Car"
  )

Going further

Two excellent books

Introduction to Modern Statistics, available on Openintro

R for Data Science, available on R4DS

Both books are freely available online.

Introduction to R and RStudio

Welcome

Meet the instructor

Objectives of the workshop

Expectations

Outline

Data analysis

What is data analysis?

Examples of data analyses in practice

R and RStudio

Why R

Getting started

Manipulating numbers

Using R as a calculator

Comments

Calculating more complex expressions

Assigning values

Reconstructing x

The environment

Simple operations

Manipulating vectors

Vectors

Assigning a vector

Operations on vectors

Manipulating dataframes

Dataframes

Libraries

The mpg dataframe

Dimensions of mpg

Accessing variables

Break

Visualization

Visualization

Types of variables

Histograms

Title, axis label and number of bins

Help files

Boxplots

Figure editing

Scatterplot

Figure editing

Tables

Table for numerical variable?

Two-way tables

Modeling (requires statistics)

t-test

Paired t-test

Simple linear regression

Multiple linear regression

Interactions

ANOVA

Pairwise t-tests

Visualizing models

Visualizing simple regression

Visualizing ANOVA

Going further

Two excellent books

`R` and `RStudio`

Why `R`

Using `R` as a calculator

Reconstructing `x`

The `mpg` dataframe

Dimensions of `mpg`