Development and validation of multicategory risk prediction models

1.Comparison of logistic regression with machine learning for multicategory risk prediction

Ovarian tumors. We first compared logistic regression based methods with machine learning methods to diagnose ovarian tumors by estimating the risk that a tumor is benign, borderline, primary invasive, or secondary metastatic (1). Machine learning methods were based on support vector machines and kernel logistic regression. We focused on discrimination when evaluating model performance. The performance of logistic regression based methods was similar to the performance of the machine learning approaches.

Later, using more data from IOTA, we compared methods based on logistic regression to machine learning methods to estimate the risk that a tumor is benign, borderline, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic (2). Machine learning methods were based on support vector machines, k nearest neighbors, random forests, naïve Bayes, and nearest shrunken centroids. This time we also focused on calibration when evaluating model performance. This analysis showed again that the performance logistic regression based methods was similar to or better than the performance of the machine learning methods.

Pregnancies of unknown location. We compared logistic regression based methods with machine learning methods to diagnose pregnancies of unknown location by estimating the risk that the pregnancy turns out to be failed, ongoing intra-uterine, or ongoing ectopic (3). Machine learning methods were based on neural networks, support vector machines (SVMs), and kernel logistic regression. Again, performance of logistic regression methods was at least similar to performance of the machine learning approaches.

2.Evaluation of discrimination of multicategory risk models

In 2008 we gave an overview of possible discrimination measures for risk models of nominal outcomes (4). We described extensions of the classical AUC or c statistic, followed by suggestions of possible ‘weighted’ multicategory AUCs. In the current medical statistical literature, weighted multicategory AUCs are in fact extensions of the discrimination slope (Yates, Organ Behav Human Perf 1982), also known as the coefficient of discrimination as a possible R-squared metric for binary outcomes (Tjur, Am Stat 2009). The difference of the discrimination slope for two models is known as the Integrated Discrimination Improvement or IDI (Pencina et al, Stat Med 2008).

This evolved to a proposed multicategory c statistic for nominal outcomes called the Polytomous Discrimination Index or PDI (5). We then reviewed methods to evaluate nominal discrimination, such as the discrimination plot, nominal c statistics such as PDI, and post hoc pairwise c statistics (6).

In addition, we presented an overview of existing c statistics for risk models of ordinal outcomes, and based on this proposed a measured called the ordinal c statistic or ORC (7).

3.Evaluation of multicategory calibration

Calibration assesses whether estimated risks are accurate, i.e. that an estimate of 30% corresponds to 30% in the intended population. There is a lot of information about calibration assessment for models with a binary outcome, and we have extended methods for binary outcomes to models for multicategory (nominal) outcomes based on multinomial logistic regression (8). Then we have proposed a generic framework to derive calibration plots for any multicategory model such that the model does not have to be based on multinomial logistic regression (2). In addition, we proposed in this paper the Estimated Calibration Index (ECI) to quantify calibration. The ECI is convenient when the performance of different models or algorithms are compared.

4.Updating of multicategory risk models

Updating existing risk models for use in a new setting is an interesting approach that does not require the development of a new model. We performed an initial evaluation of the use of such methods for multicategory outcomes (9). The multicategory outcomes were modeled through a sequence of binary models (e.g. a model for category A vs B/C, followed by a model for category B vs C), such that binary updating methods could be used. We are currently working on updating methods for multinomial logistic regression models.

References

Van Calster B, Valentin L, Van Holsbeke C, Testa AC, Bourne T, Van Huffel S, et al. Polytomous diagnosis of ovarian tumors as benign, borderline, primary invasive or metastatic: development and validation of standard and kernel-based risk prediction models. BMC Med Res Methodol. 2010;10:96.

Van Hoorde K, Van Huffel S, Timmerman D, Bourne T, Van Calster B. A spline-based tool to assess and visualize the calibration of multiclass risk predictions. J Biomed Inform. 2015;54:283-93.

Van Calster B, Condous G, Kirk E, Bourne T, Timmerman D, Van Huffel S. An application of methods for the probabilistic three-class classification of pregnancies of unknown location. Artif Intell Med. 2009;46(2):139-54.

Van Calster B, Van Belle V, Condous G, Bourne T, Timmerman D, Van Huffel S. Multi-class AUC metrics and weighted alternatives. 2008. p. 1390-6.

Van Calster B, Van Belle V, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW. Extending the c-statistic to nominal polytomous outcomes: the Polytomous Discrimination Index. Stat Med. 2012;31(23):2610-26.

Van Calster B, Vergouwe Y, Looman CW, Van Belle V, Timmerman D, Steyerberg EW. Assessing the discriminative ability of risk models for more than two outcome categories. Eur J Epidemiol. 2012;27(10):761-70.

Van Calster B, Van Belle V, Vergouwe Y, Steyerberg EW. Discrimination ability of prediction models for ordinal outcomes: relationships between existing measures and a new measure. Biom J. 2012;54(5):674-85.

Van Hoorde K, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW, Van Calster B. Assessing calibration of multinomial risk prediction models. Stat Med. 2014;33(15):2585-96.

Van Hoorde K, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW, Van Calster B. Simple dichotomous updating methods improved the validity of polytomous prediction models. J Clin Epidemiol. 2013;66(10):1158-65.