[1] 12
Raphael Morsomme
R
and RStudio
R
R
.R
or any other programming language.R
.R
and RStudio
“Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. (…) In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.”
Source: Wikipedia
R
and RStudio
R
R
as a calculatorType the following in the Console (lower left corner in RStudio)
and press (ctrl/command + enter) to run the line of code.
Instead of typing, you can also simply copy and paste the code in the console.
The sky is the limit.
Suppose you want to use the number
for different purposes, e.g. multiply it, divide it, raise it to different powers, etc.
We can simply store this number by assign it to some object with the command =
.
The object x
is the number I wanted to save.
x
Let us use what we just learned to reconstruct x
from the numerator, denominator and exponent.
Check the environment in the upper right corner. It contains the objects that you have created.
you should see the objects denom
, expon
, numer
, x
and x_reconstructed
.
Note that x
and x_reconstructed
have the same value!
Here is a list of R
commands for common mathematical operations
On the previous slide, note that the last line of code is equivalent to the following two lines
A vector of length \(k\) is simply a sequence of \(k\) numbers.
In R
, we create vectors with the c
command
I can of course assign a vector to an object
Here is a list of R
commands for common operations on vectors
A dataframe is simply a collection of vectors of same length.
Vectors: columns
Dataframes: rectangles
Packages are user-created collections of new functions for R
.
Packages need to be
mpg
dataframeFor simplicity, I assign the dataframe mpg
to the object d
I print its first few rows for inspection with the command head
.
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
mpg
The commands nrow
and ncol
respectively provide the number of rows (observations) and columns (variables) of a dataframe.
I use the command $
to access a variable
[1] 1.8 1.8 2.0 2.0 2.8 2.8 3.1 1.8 1.8 2.0 2.0 2.8 2.8 3.1 3.1 2.8 3.1 4.2
[19] 5.3 5.3 5.3 5.7 6.0 5.7 5.7 6.2 6.2 7.0 5.3 5.3 5.7 6.5 2.4 2.4 3.1 3.5
[37] 3.6 2.4 3.0 3.3 3.3 3.3 3.3 3.3 3.8 3.8 3.8 4.0 3.7 3.7 3.9 3.9 4.7 4.7
[55] 4.7 5.2 5.2 3.9 4.7 4.7 4.7 5.2 5.7 5.9 4.7 4.7 4.7 4.7 4.7 4.7 5.2 5.2
[73] 5.7 5.9 4.6 5.4 5.4 4.0 4.0 4.0 4.0 4.6 5.0 4.2 4.2 4.6 4.6 4.6 5.4 5.4
[91] 3.8 3.8 4.0 4.0 4.6 4.6 4.6 4.6 5.4 1.6 1.6 1.6 1.6 1.6 1.8 1.8 1.8 2.0
[109] 2.4 2.4 2.4 2.4 2.5 2.5 3.3 2.0 2.0 2.0 2.0 2.7 2.7 2.7 3.0 3.7 4.0 4.7
[127] 4.7 4.7 5.7 6.1 4.0 4.2 4.4 4.6 5.4 5.4 5.4 4.0 4.0 4.6 5.0 2.4 2.4 2.5
[145] 2.5 3.5 3.5 3.0 3.0 3.5 3.3 3.3 4.0 5.6 3.1 3.8 3.8 3.8 5.3 2.5 2.5 2.5
[163] 2.5 2.5 2.5 2.2 2.2 2.5 2.5 2.5 2.5 2.5 2.5 2.7 2.7 3.4 3.4 4.0 4.7 2.2
[181] 2.2 2.4 2.4 3.0 3.0 3.5 2.2 2.2 2.4 2.4 3.0 3.0 3.3 1.8 1.8 1.8 1.8 1.8
[199] 4.7 5.7 2.7 2.7 2.7 3.4 3.4 4.0 4.0 2.0 2.0 2.0 2.0 2.8 1.9 2.0 2.0 2.0
[217] 2.0 2.5 2.5 2.8 2.8 1.9 1.9 2.0 2.0 2.5 2.5 1.8 1.8 2.0 2.0 2.8 2.8 3.6
This is simply a vector. I can assign it to a object and manipulate it.
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
Roughly speaking, there are two types of variables
A histogram summarizes the distribution of a continuous variable. Simply use the command hist
.
We can improve the figure by adding a title, improving the x-axis label and changing the number of breaks
The help file of a R
function describes
xlab
, main
, breaks
)Boxplots are a more compact way than histograms to summarizes the distribution of a continuous variable. To make a boxplot, simply use the command boxplot
.
Again, we can add a title and a y-axis label to improve the figure.
Scatterplot are used to visualize the relationship between two continuous variables
To summarize a categorical variable, we can use a table
What happens if we make a table for a numerical variable?
12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 41
5 2 10 7 31 10 13 11 2 7 7 13 15 32 14 7 22 4 7 4 2 1 2 2 1 1
44
2
This is not helpful! A histogram would be much better.
To look at the relation between two categorical variables, we can use a two-way table
Is the fuel consumption in the city and on the highway the same?
Welch Two Sample t-test
data: d$cty and d$hwy
t = -13.755, df = 421.79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.521683 -5.640710
sample estimates:
mean of x mean of y
16.85897 23.44017
Actually, we should use a paired t-test in this case.
Call:
lm(formula = hwy ~ cty, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.3408 -1.2790 0.0214 1.0338 4.0461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.89204 0.46895 1.902 0.0584 .
cty 1.33746 0.02697 49.585 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.752 on 232 degrees of freedom
Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16
Call:
lm(formula = hwy ~ cty + displ, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.3124 -1.2423 0.0053 1.0296 4.1243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.15145 1.21271 0.949 0.343
cty 1.32914 0.04490 29.602 <2e-16 ***
displ -0.03432 0.14791 -0.232 0.817
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.756 on 231 degrees of freedom
Multiple R-squared: 0.9138, Adjusted R-squared: 0.913
F-statistic: 1224 on 2 and 231 DF, p-value: < 2.2e-16
Call:
lm(formula = hwy ~ cty + displ + cty:displ, data = d)
Residuals:
Min 1Q Median 3Q Max
-4.2546 -1.0888 0.1424 0.8978 4.1637
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.32675 1.48875 3.578 0.000422 ***
cty 1.03665 0.07795 13.299 < 2e-16 ***
displ -1.69320 0.39471 -4.290 2.64e-05 ***
cty:displ 0.12029 0.02670 4.505 1.06e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.687 on 230 degrees of freedom
Multiple R-squared: 0.9208, Adjusted R-squared: 0.9198
F-statistic: 891.2 on 3 and 230 DF, p-value: < 2.2e-16
Or simply
Call:
lm(formula = hwy ~ cty * displ, data = d)
Residuals:
Min 1Q Median 3Q Max
-4.2546 -1.0888 0.1424 0.8978 4.1637
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.32675 1.48875 3.578 0.000422 ***
cty 1.03665 0.07795 13.299 < 2e-16 ***
displ -1.69320 0.39471 -4.290 2.64e-05 ***
cty:displ 0.12029 0.02670 4.505 1.06e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.687 on 230 degrees of freedom
Multiple R-squared: 0.9208, Adjusted R-squared: 0.9198
F-statistic: 891.2 on 3 and 230 DF, p-value: < 2.2e-16
pairwise.t.test(
x = d$hwy, # response variable
g = d$class, # grouping variable
p.adjust.method = "bonferroni" # Bonferroni correction to control type-I error rate.
)
Pairwise comparisons using t tests with pooled SD
data: d$hwy and d$class
2seater compact midsize minivan pickup subcompact
compact 0.59562 - - - - -
midsize 1.00000 1.00000 - - - -
minivan 1.00000 7.1e-06 0.00052 - - -
pickup 3.9e-05 < 2e-16 < 2e-16 0.00011 - -
subcompact 0.82210 1.00000 1.00000 2.9e-05 < 2e-16 -
suv 0.00064 < 2e-16 < 2e-16 0.00335 1.00000 < 2e-16
P value adjustment method: bonferroni
Introduction to Modern Statistics, available on Openintro
R for Data Science, available on R4DS
Both books are freely available online.
Raphael Morsomme
Comments
What follows a “#” is a comment
The part
2 ^ 10
is run byR
and the part# exponent
is ignored byR
.Comments are useful for communicating with collaborators.