Prediction project

Overview

In this prediction project, you will work in pairs to construct a regression model that will be used to make predictions. The prediction model is due on Tuesday, May 31, at 9:00 am, and the final report is due Wednesday, June 1 at 9:00pm. You can find your teammate here.

🍀 Good luck! 🍀

Introduction

The goal of the prediction project is for you to use regression analysis to construct a linear regression model with good prediction accuracy, demonstrating proficiency in the techniques we have covered in class so far and applying them to a real data set in a meaningful way.

All analyses must be done in RMarkdown, and all components of the project must be reproducible (with the exception of the presentation).

Academic Integrity

By participating in this project, you pledge to uphold the Duke Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;

  • I will conduct myself honorably in all my endeavors; and

  • I will act if the Standard is compromised.

Data

We will work with the natality data for the U.S. in 2020. The response variable that we are interested in is newborn’s weight. This variable is important to medical professionals since a newborn with a low birth weight is more likely to require additional care. You have access to a random sample of 1,000 observations to construct your prediction model. I have kept a separate set of random 20,000 observations to evaluate the prediction model that you select.

Here is a detailed overview of the variable

  • newborn_birth_weight: newborn birth weight in grams (response)
  • month: birth month (1 = January, …, 12 = December)
  • mother_age: age of the mother in years
  • prenatal_care_starting_month: month in which prenatal care began; if 0, there was no prenatal care
  • daily_cigarette_prepregnancy: daily number of cigarettes smoked before the pregnancy
  • daily_cigarette_trimester_1: daily number of cigarettes smoked during the 1st trimester of the pregnancy
  • daily_cigarette_trimester_2: daily number of cigarettes smoked during the 2nd trimester of the pregnancy
  • daily_cigarette_trimester_3: daily number of cigarettes smoked during the 3rd trimester of the pregnancy
  • mother_height: height of the mother in inches
  • mother_bmi: body mass index of the mother
  • mother_weight_prepregnancy: weight of the mother before the pregnancy in pounds
  • mother_weight_delivery: weight of the mother at delivery in pounds
  • mother_diabetes_gestational: whether the mother had diabetes during the pregnancy
  • newborn_sex: sex of the newborn
  • gestation_week: number of gestational weeks
  • mother_risk_factors: whether the mother had any risk factor (diabetes, hypertension, previous preterm birth, previous cesarean, infertility treatment used, etc)

Submission

The three primary deliverables for this project are:

  • Prediction model: you need to submit a .RDATA file that contains your prediction model. The prediction model must consist in a model fitted using the lm command in R. You can only use the 1,000 observations provided here to fit your model. The .RDATA file should not contain anything else.

  • Informal presentation: you will present your work orally to rest of the class. The presentation should be no longer than 5 minutes (aim for 2-6 slides). It is fine if the presentation is shorter than 5 minutes, but it cannot exceed 5 minutes. The two team members should speak roughly the same amount of time. Each presentation will be followed by a short QA session.

  • Final report: The final report details your work. It needs to be realized using RMarkdown and submitted on Gradescope as a PDF. The RMD file also needs to be sent via email to the instruction team (reproducibility). The page limit is 6 pages (including code chunks, but excluding the appendix). Figures should go in the appendix, along with any work that you wish to include. Grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report.

Final report content

The final report should include the following sections, though you should feel free to include additional sections as necessary.

  • Short introduction: briefly mention the variables you have considered, those you have engineered, and the approach you have chosen for model selection.
  • Variable selection and engineering: list the variables that you have chosen to consider; briefly explain your reasoning; describe new variables that you have engineered; visualize every variable that you create.
  • Outliers: if there were outliers, explain how you treated them.
  • Model fitting: you need to fit at least 5 models:
    • a simple linear regression model with gestation_week as the predictor
    • the full model with all 15 raw predictors
    • at least 3 other models containing new variables that you have created
  • Model selection: select the model that you will submit using
    • a model selection criterion,
    • the holdout method, and
    • cross-validation
  • Discussion/conclusion: briefly discuss your result, any limitation to your work, and what predictor(s) you have liked to have in the dataset (e.g. weight of father) to make the prediction model more accurate.

Rubric (20 points)

Part Points

Prediction model

  • The RDATA file contains only the selected model

  • The model is fitted using the lm command in R

  • The model is fitted using only the sample provided

  • Most accurate model of the class

3

1

1

1

Bragging rights

Report

  • The analysis is fully reproducible

  • The simple model with gestation_week is fitted

  • The full model is fitted

  • At least one new predictor is created via transformation

  • At least one new predictor is created via combination

  • At least one interaction is considered

  • At least one categorical predictor is considered

  • Subject knowledge is used to guide variable selection and feature engineering

  • Data visualization is used to guide variable selection and feature engineering

  • All figures are useful and neatly formatted

  • Outliers, if any, are handled appropriately

  • At least 5 models (including the simple and full models) are compared using

    • the entire sample with a suitable criterion

    • a test set with the holdout method

    • a test set with cross-validation

16

1

1

1

1

1

1

1

2

2

1

1

(3)

1

1

1

Presentation

  • You ask at least one question during QA

1

1

Total 20