Skip to Content
Interactive Textbook on Clinical Symptom Research Logo


Home Button

Statistical Models for Prognostication
Author Bio
Introduction
Predictions: Statistical Models
Insight: Statistical Models
Ingredients: Statistical Models
Theoretical Aspects
Central Concepts
Currently selected section: Regression Models
Problems: Regression
Practical Advice
Example 1
Example 2
Chapter 8: Statistical Models for Prognostication: Development of Regression Models
        

Coding of Covariables

The coding of both categorical and continuous covariables merits attention in the development of a predictive model.

Categorical covariables may have just 2 values (i.e. binary), e.g. gender (male/female), or presence of risk factors (yes/no), in which case the coding is easy. Categorical covariables may also have more values, either with or without ordering (nominal/ordinal). Categorical covariables may be coded as factor variables, which implies that a reference category is chosen and that other values are contrasted against this category by "dummy" variables.

Dummy variables indicate whether a value is in a certain category or not. For example, a three-category variable, such as smoking (current, ex-smoker, non-smoker), might be coded with two dummy variables indicating whether the patient is a current smoker and another indicating whether the patient is an ex-smoker, with non-smokers as the reference category. Technically the coding can be represented as follows:

 Current Exsmk
Non-smoker
0
0
Ex-smoker
0
1
Current
1
0

For the analysis, we might also combine the ex-smokers and current smokers in a single category. This collapsing might be based on findings in other studies or the observed frequency distribution (e.g. few ex-smokers). This does not lead to bias in the estimated coefficient.

The decision to collapse may also be based on the observed regression coefficients for "current" and "exsmk" (either in univariable or multivariable analysis). The resulting regression coefficient will then no longer be unbiased, since the decision to collapse is not independently taken. (For more about the philosophy of such aspects, see Problems with Regression Models: Model Uncertainty).

Continuous covariables are frequently coded as linear terms in a regression model. This linearity may be tested as described before (see: Theoretical Aspects of Predictive Modeling: Linearity Assumption). Also, one might beforehand specify a certain transformation based on biological reasoning or prior knowledge.

Another possibility would be to include flexible functions with a pre-specified number for the degrees of freedom, irrespective of the statistical significance of certain non-linear terms. For example, if age is known to be an important predictor, one might specify that a restricted cubic spline function is fitted with 4 knots (Harrell et al., 1988). This function has 3 degrees of freedom (df). This implies that it has the flexibility to have two bendings, as illustrated in the graph.

Figure 7.1: Restricted Cubic Splines
Graphic depiction of restricted cubic splines, described in text.
Illustration of 3 restricted cubic spline functions with 4 knots (3 degrees of
freedom). The lines with markers represent the fitted functions, which
closely follow the underlying mathematically defined functions.

 

 

Previous Page