02:00
STA 101L - Summer I 2022
Raphael Morsomme
Source: IMS
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
02:00
There are 5479 observations (rows)
and 6 variables (columns)
# A tibble: 6 x 6
year month date_of_month date day_of_week births
<int> <int> <int> <date> <ord> <int>
1 2000 1 1 2000-01-01 Sat 9083
2 2000 1 2 2000-01-02 Sun 8006
3 2000 1 3 2000-01-03 Mon 11363
4 2000 1 4 2000-01-04 Tues 13032
5 2000 1 5 2000-01-05 Wed 12558
6 2000 1 6 2000-01-06 Thurs 12466
Rows: 5,479
Columns: 6
$ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20~
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
$ date <date> 2000-01-01, 2000-01-02, 2000-01-03, 2000-01-04, 2000-01~
$ day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tue~
$ births <int> 9083, 8006, 11363, 13032, 12558, 12466, 12516, 8934, 794~
We can change the number of bins to have a rougher or more detailed histogram.
To describe the distribution of a numerical variable, we comment on
The distribution of the daily number of births in the US is bimodal with each mode being bell-shaped and symmetric. We observe no extreme value.
04:00
Histograms: visualize the distribution of a single numerical variable.
Scatterplots: visualize the relation between two numerical variables.
mpg
dataset# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
We will look at the relation between engine size (disp
) and fuel efficiency (hwy
).
\[ \overline{\text{age}} = \dfrac{\text{age}_{Hayden} + \text{age}_{Janice} + \text{age}_{Kenndy} + \text{age}_{Maggie} + \text{age}_{Melissa} + \text{age}_{Yuanzhi}}{6} \]
Percentiles are a generalization of the median.
The value that is larger than p% of the data and smaller than the rest is called the p-th percentile.
The median is the 50th percentile.
We will soon make use of the 25th and 75th percentiles.
Later in the course, the 95th and 97.5th percentiles will also be useful.
Real-world data often contain extreme values
The average, median, variance, sd and iqr are not equally robust to the presence of extreme values.
Let us contaminate the birth data with an extreme value of 1 billion…
…and compare the mean, median, variance, sd and iqr of these two variables.
Min. 1st Qu. Median Mean 3rd Qu. Max.
5728 8740 12343 11350 13082 16081
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.728e+03 8.740e+03 1.234e+04 1.938e+05 1.308e+04 1.000e+09
05:00
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
03:00
We can add a second categorical variable using colors.
03:00
✅ Combines the strengths of the various barplots.
🛑 Not in the tool box of every data scientist
Source: R 4 Data Science
The thick line in the middle of the box indicates the median;
the box stretches from the 25th percentile (Q1) to the 75th percentile (Q3); it covers 50% of the data;
the length of the whiskers are at most 1.5 iqr;
any observation more than 1.5 iqr away from the box is labelled as an outlier;
01:00
table(d_car$class, d_car$drv) %>%
prop.table(1) %>%
round(2) %>%
kbl(caption = "Distribution of drive type per class of car") %>%
kable_classic(full_width = FALSE, c("striped", "hover"))
4 | f | r | |
---|---|---|---|
2seater | 0.00 | 0.00 | 1.00 |
compact | 0.26 | 0.74 | 0.00 |
midsize | 0.07 | 0.93 | 0.00 |
minivan | 0.00 | 1.00 | 0.00 |
pickup | 1.00 | 0.00 | 0.00 |
subcompact | 0.11 | 0.63 | 0.26 |
suv | 0.82 | 0.00 | 0.18 |
📋 See this vignette for more details on editing tables
📋 See R for Data Science - chapters 3 and 7 for more on data visualization in R.
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey