02:00
STA 101L - Summer I 2022
Raphael Morsomme
Breakdown of variables into their respective types.
Source: IMS
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
Group exercise - data summaries?
02:00
There are 5479 observations (rows)
and 6 variables (columns)
# A tibble: 6 x 6
year month date_of_month date day_of_week births
<int> <int> <int> <date> <ord> <int>
1 2000 1 1 2000-01-01 Sat 9083
2 2000 1 2 2000-01-02 Sun 8006
3 2000 1 3 2000-01-03 Mon 11363
4 2000 1 4 2000-01-04 Tues 13032
5 2000 1 5 2000-01-05 Wed 12558
6 2000 1 6 2000-01-06 Thurs 12466
Rows: 5,479
Columns: 6
$ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20~
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
$ date <date> 2000-01-01, 2000-01-02, 2000-01-03, 2000-01-04, 2000-01~
$ day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tue~
$ births <int> 9083, 8006, 11363, 13032, 12558, 12466, 12516, 8934, 794~
We can change the number of bins to have a rougher or more detailed histogram.
Tip
To explore a numerical variable, always start with a histogram
To describe the distribution of a numerical variable, we comment on
Describing a distribution is an art
Note that some distributions will not fit nicely in these categories.
The distribution of the daily number of births in the US is bimodal with each mode being bell-shaped and symmetric. We observe no extreme value.
Group exercise - describing a distribution
Describe the distributions in exercises 5.6, 5.13, 5.24 and 5.26 (only consider the histograms)
04:00
Histograms: visualize the distribution of a single numerical variable.
Scatterplots: visualize the relation between two numerical variables.
mpg
dataset# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
We will look at the relation between engine size (disp
) and fuel efficiency (hwy
).
Note
To add an additional variable to your visualization, you can use color or symbols.
\[ \overline{\text{age}} = \dfrac{\text{age}_{Hayden} + \text{age}_{Janice} + \text{age}_{Kenndy} + \text{age}_{Maggie} + \text{age}_{Melissa} + \text{age}_{Yuanzhi}}{6} \]
Percentiles are a generalization of the median.
The value that is larger than p% of the data and smaller than the rest is called the p-th percentile.
The median is the 50th percentile.
We will soon make use of the 25th and 75th percentiles.
Later in the course, the 95th and 97.5th percentiles will also be useful.
Real-world data often contain extreme values
The average, median, variance, sd and iqr are not equally robust to the presence of extreme values.
Let us contaminate the birth data with an extreme value of 1 billion…
…and compare the mean, median, variance, sd and iqr of these two variables.
Min. 1st Qu. Median Mean 3rd Qu. Max.
5728 8740 12343 11350 13082 16081
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.728e+03 8.740e+03 1.234e+04 1.938e+05 1.308e+04 1.000e+09
Robustness of the median and the iqr
While the median and iqr are robust to the presence of extreme values, the mean, variance and sd are not.
05:00
# A tibble: 6 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
Group exercise - contigency and proportion table
03:00
We can add a second categorical variable using colors.
Group exercise - pros and cons of barplots
Exercise 4.5
03:00
Beyond 2 variables
Faceted figures are a great way to include \(\ge\) 3 variables! See exercise 1.13
✅ Combines the strengths of the various barplots.
🛑 Not in the tool box of every data scientist
From raw data to boxplot
Source: R 4 Data Science
The thick line in the middle of the box indicates the median;
the box stretches from the 25th percentile (Q1) to the 75th percentile (Q3); it covers 50% of the data;
the length of the whiskers are at most 1.5 iqr;
any observation more than 1.5 iqr away from the box is labelled as an outlier;
Outliers
Outliers have an extreme value. How to deal with an outlier depends on why the observation stands out. Outliers can be
Group exercise - limitation of boxplots
Exercise 5.13
01:00
table(d_car$class, d_car$drv) %>%
prop.table(1) %>%
round(2) %>%
kbl(caption = "Distribution of drive type per class of car") %>%
kable_classic(full_width = FALSE, c("striped", "hover"))
4 | f | r | |
---|---|---|---|
2seater | 0.00 | 0.00 | 1.00 |
compact | 0.26 | 0.74 | 0.00 |
midsize | 0.07 | 0.93 | 0.00 |
minivan | 0.00 | 1.00 | 0.00 |
pickup | 1.00 | 0.00 | 0.00 |
subcompact | 0.11 | 0.63 | 0.26 |
suv | 0.82 | 0.00 | 0.18 |
📋 See this vignette for more details on editing tables
📋 See R for Data Science - chapters 3 and 7 for more on data visualization in R.
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey