|
|
 |
| |
Overfitting
may be labeled the "curse of predictive modeling." Overfitting
refers to the phenomenon in which a predictive model may well
describe the relationship between predictors and outcome in
the patients used to develop the model, but may subsequently
fail to provide valid predictions in new patients. The model
shows an adequate fit in the data set under study, but does
not validate, that is, does not provide accurate predictions
for observations from a new dataset.
QUESTION
6.1
What issues
do not contribute to overfitting?
Elsewhere in this chapter, we further discuss:
What
causes overfitting? First, the data set under study is only
a sample from the underlying population. Our aim is not to describe
the sample as completely and accurately as we can, but to provide
a prognostic model that provides valid predictions for the underlying
population. In principle, the sample offers us the opportunity
to learn about the population. However, when the sample size
is relatively small, there will be considerable uncertainty
in what we learn.
Second,
several statistical procedures may work well in data sets of
infinite size, but may ask too much of the data when the size
is relatively small. These procedures include:
-
Stepwise
selection methods, which lead to the identification of predictors
that are most important in the data set under study, and
-
Extensive
assessments of the validity of the model assumptions (linearity,
additivity, distributional) with iterative adjustment of the
model structure.
Elsewhere
in this chapter, we further discuss stepwise selection methods
(see: Development of Regression Models: Selection
of Covariables), and linearity, additivity, and distributional
model assumptions (see Theoretical
Aspects of Predictive Modeling).
These statistical
procedures lead to bias
in the estimated regression coefficients by overstressing idiosyncrasies
of patterns in the data.
Third, complicated
statistical research has shown that predictions are too extreme
(high predictions too high and low predictions too low) when
regression models are fitted to a data set, even when the model
was fully specified without studying the data (Copas,
1983) (Van
Houwelingen and Le Cessie, 1990). Although the regression
coefficients are nearly unbiased estimates in this case, the
model benefits from some sort of reduction of the regression
coefficients ("shrinkage") to provide reliable predictions (for
more on shrinkage: see Development
of Regression Models - Estimation of Regression Coefficients
).
|
|