Machine learning-based derivation and external validation of a tool to predict death and development of organ failure in hospitalized patients with COVID-19

Study design and patient population

The University of Washington (UW) dataset includes demographic and clinical data from COVID-19 positive adult patients (≥18 years of age) who were admitted to two UW hospitals (Montlake campus and Harborview ) between March 2020 and March 2021 A confirmed case of COVID-19 was defined by a positive reverse transcriptase-polymerase chain reaction (RT-PCR) test result. Tongji Hospital’s COVID-19 dataset is publicly available6. Briefly, patients from the Tongji COVID-19 dataset were recruited from January 10 to February 18, 2020. Patients from the Tongji dataset constituted the external validation cohort for the mortality model. In the UW and Tongji datasets, mortality prediction models were developed using clinical data collected during the first 24 hours after arrival at the hospital.

Ethical approval and consent to participate

The University of Washington Institutional Review Board (IRB) approved the study protocol (STUDY10159). All clinical investigations were conducted based on the principles expressed in the Declaration of Helsinki. Written informed consent was withdrawn by the University of Washington IRB due to the retrospective nature of our review of routine clinical data.


The primary endpoint was in-hospital mortality. We developed and internally validated a model for predicting hospital mortality and externally validated the model in the Tongji dataset. Secondary endpoints were transfer to ICU, shock, and receipt of RRT. These secondary outcomes were missing from the Tongji dataset and so we developed and validated prediction models for the secondary outcomes using the UW dataset. Shock was defined as the new receipt of vasopressor drugs after the first day of hospitalization.

Feature selection

Since the mortality prediction model was developed in the UW dataset and validated externally in the Tongji dataset, we first selected the variables that overlapped between the two datasets. Twenty characteristics overlapped between the two data sets, and these 20 characteristics were used for the mortality prediction model. All clinical and laboratory data were extracted from the medical record during the first day of hospital admission, and patients were included in the analysis for each result only if patients did not have the result on first day of hospitalization. An individual prediction model was developed for each outcome.

The following steps were followed for feature selection. First, features were removed if > 10% of values ​​were missing. Second, traits with variance close to zero were removed, as these traits were almost exclusively unique in value. Third, pairwise correlations between all characteristics were calculated. If two characteristics had a correlation greater than 0.8, the characteristic with the highest mean absolute correlation was dropped. Fourth, missing values ​​were replaced by the mode if the variable was categorical or by the median otherwise. Finally, all continuous variables were standardized.

Data partitioning, UW dataset

We randomly divided the UW dataset into development and internal validation sets by stratified sampling. The training set included 475 patients and the internal validation set included 237 patients. First, we trained models on the training set and then selected the best model based on its performance on the internal validation set. The best hospital mortality models were then tested in the external validation set. We performed cross-validation in the internal validation set for the three prediction models for transfer to ICU, shock, and RRT. We used the UW dataset as follows (1) patients were randomly divided into 10 times in a stratified manner using the outcome variable; (2) the model was trained using nine of the ten plies and tested on the remaining plie. The procedure was repeated ten times until each ply was used exactly once as a test ply.

Machine learning models

Least Absolute Selection and Removal Operator (LASSO) Logistic Regression is a logistic regression approach with L1 penalties20. L1 penalty terms encourage parsimony, thus preventing overfitting and producing a small model. A weighted LASSO logistic regression was used to deal with the unbalanced data. The lambda hyperparameter was selected by tenfold stratified cross-validation.

Elastic net logistic regression (LR) is an approach that combines LASSO LR and ridge logistic regression, incorporating both L1 and L2 penalties21. It can generate sparse models that outperform LASSO logistic regression when highly correlated predictors are present. The alpha and lambda hyperparameters were selected by tenfold stratified cross-validation.

Extreme amplification of gradients (XGBoost). XGBoost is a gradient boosted machine (GBM) based on decision trees that separate patients with and without the outcome of interest using simple yes-no splits, which can be visualized as decision trees22. GBM builds sequential trees, so each tree attempts to improve model fit by weighting hard-to-predict patients more heavily. The following hyperparameter settings were applied: nrounds=150, eta=0.2, colsample_bytree=0.9, gamma=1, subsample=0.9, and max_depth=4. We also used grid search to select the optimal hyperparameters for XGBoost on the training set. The hyperparameter candidates were exhaustively generated from the number of boosting rounds (nrounds) = {150,250,350}, eta = {0.1,0.2,0.3}, colsample_bytree = {0.5,0.7,0.9}, gamma = {0.5, 1} and max_depth = {4,8,12}. We used five-fold stratified cross-validation to select the optimal hyperparameter that maximized mean AUC for the mortality prediction model. Next, we retrained the model using the optimal hyperparameters on the training set, and then we tested and validated this model on the inner validation and outer validation sets, respectively.

Management of class imbalances

A weighted version of each of the above three methods was used to deal with the unbalanced data. For example, if there were 90 positives and 10 negatives, then a weight of 10 out of 90 was assigned to a positive sample and a weight of one was assigned to a negative sample.

Probability calibration

Isotonic regression has been used to calibrate the probabilities produced by machine learning models23. The calibration model was fitted on the training samples only. A calibration plot was created to assess the agreement between the predictions and the observed results at different percentiles of the predicted values, and the 45 degree reference line indicates a perfectly calibrated model. If the fitted curve is below the baseline, this indicates that the model is overestimating the probability of the outcome. For comparison, a fitted curve above the baseline reflects an underestimation.

Model comparison

We tested the three machine learning methods (LASSO LR, elastic net LR, and XGBoost) independently to predict each outcome. Model performance was compared using the area under the receiver’s operating characteristic curve (AUC) and 95% CI24.25. The best performing models for in-hospital mortality in the internal validation cohort were then carried over to the external validation cohort. We also performed a pre-specified subgroup analysis of model performance in patients older than 50 years and in patients younger than 50 years. Both sides p values ​​

Ethics approval

The University of Washington Institutional Review Board approved this study.

Sherry J. Basler