Calibration: the Achilles heel of predictive analytics

Ben Van Calster; David J McLernon; Maarten van Smeden; Laure Wynants; Ewout W Steyerberg; Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative

doi:10.1186/s12916-019-1466-7

Calibration: the Achilles heel of predictive analytics

BMC Med. 2019 Dec 16;17(1):230. doi: 10.1186/s12916-019-1466-7.

Authors

Ben Van Calster^{1

2

3}, David J McLernon^{4

5}, Maarten van Smeden^{6

7

5}, Laure Wynants^{8

9}, Ewout W Steyerberg^{6

5}; Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative

Collaborators

Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative:
Patrick Bossuyt, Gary S Collins, Petra Macaskill, David J McLernon, Karel G M Moons, Ewout W Steyerberg, Ben Van Calster, Maarten van Smeden, Andrew J Vickers

Affiliations

¹ Department of Development and Regeneration, KU Leuven, Herestraat 49 box 805, 3000, Leuven, Belgium. ben.vancalster@kuleuven.be.
² Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands. ben.vancalster@kuleuven.be.
³ , . ben.vancalster@kuleuven.be.
⁴ Medical Statistics Team, Institute of Applied Health Sciences, School of Medicine, Medical Sciences and Nutrition, University of Aberdeen, Aberdeen, UK.
⁵ .
⁶ Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands.
⁷ Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, Netherlands.
⁸ Department of Development and Regeneration, KU Leuven, Herestraat 49 box 805, 3000, Leuven, Belgium.
⁹ Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University, Maastricht, Netherlands.

Abstract

Background: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention.

Main text: Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model complexity and the available sample size. At external validation, calibration curves require sufficiently large samples. Algorithm updating should be considered for appropriate support of clinical practice.

Conclusion: Efforts are required to avoid poor calibration when developing prediction models, to evaluate calibration when validating models, and to update models when indicated. The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling.

Keywords: Calibration; Heterogeneity; Model performance; Overfitting; Predictive analytics; Risk prediction models.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adult
Aged
Algorithms
Calibration / standards*
Humans
Machine Learning / standards*
Male
Middle Aged
Predictive Value of Tests*

Grants and funding

27294/CRUK_/Cancer Research UK/United Kingdom