patient | group | 30 days | 365 days |
---|---|---|---|

1 | treatment | no event | no event |

2 | treatment | no event | no event |

3 | control | no event | no event |

4 | control | no event | no event |

5 | control | no event | no event |

STA 101L - Summer I 2022

Raphael Morsomme

- Motivating example - stent and stroke
- Principles of statistical inference
- Types of variable
- Experiments and observational studies

Stents are known to reduce the risk of an additional heart attack or death after a cardiac event.

- Could stents have similar benefits for patients at risk of stroke?

- If so, we should use this well-known procedure to reduce the risk of stroke!
- If not, the procedure (surgery) should be avoided.

We have an experiment with 451 at-risk patients:

- each volunteer patient was randomly assigned to either the treatment (stent) or the control (no stent) group
- check with patients 30 days and 365 days later

patient | group | 30 days | 365 days |
---|---|---|---|

1 | treatment | no event | no event |

2 | treatment | no event | no event |

3 | control | no event | no event |

4 | control | no event | no event |

5 | control | no event | no event |

30 days |
365 days |
|||
---|---|---|---|---|

Group | Stroke | No event | Stroke | No event |

Control | 13 | 214 | 28 | 199 |

Treatment | 33 | 191 | 45 | 179 |

Total | 46 | 405 | 73 | 378 |

Contrary to expectation, we observe more strokes in the treatment group

- Do the data show a
*real*difference between the groups? - Or is the difference simply due to chance?

This type of questions is central in statistics.

Suppose I flip a coin \(100\) times and count the number of times I obtain heads.

- I expect to observe
*about*\(50\) heads.

- Imagine that I observe \(85\) heads instead. That would be alarming; the coin is probably not fair.

- If I had observed \(55\) heads then I would not be alarmed; this is a plausible result with a fair coin.

`05:00`

patient | group | 30 days | 365 days |
---|---|---|---|

1 | treatment | stroke | stroke |

2 | treatment | no event | no event |

3 | treatment | no event | no event |

- Each row represents an
**observation** - Each column represents a
**variable**

**Observational units**: individuals, families, student cohort, cities, counties, countries, cells (biology), animals, books, courses, apples**Variables**: height, weight, age, size, year, latitude, longitude, type, sex, diet, number of pages, genre, level, color

We are typically interested in the relation between variables in some population.

The **population of interest** is often large, but with well-defined limits

- e.g. patients at risk of stroke, Duke students, trees in Duke Forest, US counties
- but not the following: people, students, patients.

There are two ways to learn about the relation between variables in a given population.

**Census**: collect data on the whole populationideal

…but typically impractical, expensive

**Sample**: small fraction of the population

- Population
**parameter**, e.g. mean number of hours that Duke students sleep per night- Greek letter: \(\mu\), \(\beta\), but also \(p\).

- Sample
**statistic**, e.g.*observed*average number of hours Duke students sleep per night in some sample- Roman letter: \(\bar{x}\), \(b\), \(\hat{p}\)

- How to learn about the population from a sample?
- …from sample statistics to population parameters
**Statistical inference**provides a rigorous framework to accomplish this.

`07:00`

When you make soup, there is no need to drink the whole pot (population) to know if the it is seasoned enough.

- Tasting a spoonful (sample) is sufficient.
- If the soup is well mixed, a spoonful is a
**representative**sample of the population

`03:00`

Are all samples created equal? No!

What can go wrong?

- small samples,
- convenience sampling, e.g. students on campus,
- blind spots, e.g. voters with no phone,
- …

Sampling is an art.

The gold standard is a **random sample**

- but even then, we can have non-response bias

🛑 Obtaining a *representative* sample is difficult.

✅ But surprisingly small representative samples can do the job!

- e.g. 1,500 voters (later in class)

- Takes a numerical value
- Examples: age, height, number of children

Numerical variables are either

**discrete**, e.g. number of siblings- or
**continuous**, e.g. a person’s height

- not always clear cut, e.g. GPA

- Takes a level (a category)
- Examples: eye color, place of birth, education level

Numerical variables are either

**nominal**, e.g. eye color- or
**ordinal**, e.g. education level

Source: IMS

`05:00`

Two variables can either be **independent** or **associated**.

If two variables are associated, the association can be

- linear (positive, or negative)
- or it can take any form, e.g. U-shape, inverted-J-shape (like a square root)

`02:00`

- When two variables are associated, we sometimes hypothesize that changes in one
*cause*changes in the other.

**Explanatory**variable \(\Rightarrow\)**response**variable

- …but association \(\neq\) causation; examples:
- ice-cream and shark attacks; fire damage and firemen
- counties and kidney cancer death rate; the best classrooms are small classrooms, but so are the worst classrooms.

Why are most of the shaded counties in the middle of the country?

`04:00`

Source: Bayesian Data Analysis

Source: Bayesian Data Analysis

Source: Bayesian Data Analysis

- The value of the explanatory variable is
**assigned**by the researcher **Randomized**experiment: the value of the explanatory variable is randomly assigned- removes any counfounding (lurking) variable, e.g. air temperature

**Blind**, or even double-blind, to avoid biases- placebo
- can go wrong, e.g. vitamins in prison

🛑 we cannot always use experiments:

- not all variables can be assigned, e.g. age
- ethical considerations, e.g. smoking cigarette, sham surgery (placebo)
- practical consideration, e.g. long-term consumption of red meat

✅ But when experiments can be implement, they lead to **causal** claims and are therefore the gold standard.

- The value of the explanatory variable is
**not assigned**by the researcher- there is no interference

- Example: survey

🛑 Does not easily lead to causal claims due to the potential presence of counfounding variables

Source: IMS

…but they can lead to causal claims in certain cases!

- E.g. smoking causes cancer.

`06:00`

- observations (row) and variables (column)
- population parameters and sample statistics
- statistical inference
- sampling
- four types of variables
- numerical: continuous, discrete
- categorical: nominal, ordinal

- experiments, observational studies and causal claims