Skip to Content
Interactive Textbook on Clinical Symptom Research Logo


Home Button

Statistical Models for Prognostication
Author Bio
Introduction
Predictions: Statistical Models
Insight: Statistical Models
Ingredients: Statistical Models
Theoretical Aspects
Currently Selected Section: Central Concepts
Regression Models
Problems: Regression
Practical Advice
Example 1
Example 2




Chapter 8: Statistical Models for Prognostication: Central Concepts in Predictive Modeling
        

Overfitting

Overfitting may be labeled the "curse of predictive modeling." Overfitting refers to the phenomenon in which a predictive model may well describe the relationship between predictors and outcome in the patients used to develop the model, but may subsequently fail to provide valid predictions in new patients. The model shows an adequate fit in the data set under study, but does not validate, that is, does not provide accurate predictions for observations from a new dataset.

QUESTION 6.1

What issues do not contribute to overfitting?

Selection AA small sample size
Selection BStepwise selection procedures to identify only statistically significant predictors
Selection COptimal recoding procedures based on thorough analysis of patterns in the data set
Selection DReduction of regression coefficients with shrinkage methods

Elsewhere in this chapter, we further discuss:

What causes overfitting? First, the data set under study is only a sample from the underlying population. Our aim is not to describe the sample as completely and accurately as we can, but to provide a prognostic model that provides valid predictions for the underlying population. In principle, the sample offers us the opportunity to learn about the population. However, when the sample size is relatively small, there will be considerable uncertainty in what we learn.

Second, several statistical procedures may work well in data sets of infinite size, but may ask too much of the data when the size is relatively small. These procedures include:

  • Stepwise selection methods, which lead to the identification of predictors that are most important in the data set under study, and
  • Extensive assessments of the validity of the model assumptions (linearity, additivity, distributional) with iterative adjustment of the model structure.

Elsewhere in this chapter, we further discuss stepwise selection methods (see: Development of Regression Models: Selection of Covariables), and linearity, additivity, and distributional model assumptions (see Theoretical Aspects of Predictive Modeling).

These statistical procedures lead to bias in the estimated regression coefficients by overstressing idiosyncrasies of patterns in the data.

Third, complicated statistical research has shown that predictions are too extreme (high predictions too high and low predictions too low) when regression models are fitted to a data set, even when the model was fully specified without studying the data (Copas, 1983) (Van Houwelingen and Le Cessie, 1990). Although the regression coefficients are nearly unbiased estimates in this case, the model benefits from some sort of reduction of the regression coefficients ("shrinkage") to provide reliable predictions (for more on shrinkage: see Development of Regression Models - Estimation of Regression Coefficients ).

 

Previous Page