Missing covariate data in medical research: to impute is better than to ignore

Kristel J M Janssen; A Rogier T Donders; Frank E Harrell Jr; Yvonne Vergouwe; Qingxia Chen; Diederick E Grobbee; Karel G M Moons

doi:10.1016/j.jclinepi.2009.12.008

Missing covariate data in medical research: to impute is better than to ignore

J Clin Epidemiol. 2010 Jul;63(7):721-7. doi: 10.1016/j.jclinepi.2009.12.008. Epub 2010 Mar 24.

Authors

Kristel J M Janssen¹, A Rogier T Donders, Frank E Harrell Jr, Yvonne Vergouwe, Qingxia Chen, Diederick E Grobbee, Karel G M Moons

Affiliation

¹ Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands. k.j.m.janssen@umcutrecht.nl

PMID: 20338724
DOI: 10.1016/j.jclinepi.2009.12.008

Abstract

Objective: We compared popular methods to handle missing data with multiple imputation (a more sophisticated method that preserves data).

Study design and setting: We used data of 804 patients with a suspicion of deep venous thrombosis (DVT). We studied three covariates to predict the presence of DVT: d-dimer level, difference in calf circumference, and history of leg trauma. We introduced missing values (missing at random) ranging from 10% to 90%. The risk of DVT was modeled with logistic regression for the three methods, that is, complete case analysis, exclusion of d-dimer level from the model, and multiple imputation.

Results: Multiple imputation showed less bias in the regression coefficients of the three variables and more accurate coverage of the corresponding 90% confidence intervals than complete case analysis and dropping d-dimer level from the analysis. Multiple imputation showed unbiased estimates of the area under the receiver operating characteristic curve (0.88) compared with complete case analysis (0.77) and when the variable with missing values was dropped (0.65).

Conclusion: As this study shows that simple methods to deal with missing data can lead to seriously misleading results, we advise to consider multiple imputation. The purpose of multiple imputation is not to create data, but to prevent the exclusion of observed data.

Publication types

Comparative Study

MeSH terms

Adult
Cross-Sectional Studies
Fibrin Fibrinogen Degradation Products / metabolism
Humans
Research Design / statistics & numerical data*
Statistics as Topic / methods*
Venous Thrombosis / diagnosis*
Venous Thrombosis / etiology

Substances

Fibrin Fibrinogen Degradation Products
fibrin fragment D