Predicting Health Checkup Participants’ Mortality Risk Using Machine Learning-Based Models: The J-SHC Study


This study was conducted as part of the ongoing study on the design of a complete medical system for chronic kidney disease (CKD) based on individual risk assessment by specific health examination (J- CHS). A specific health check-up is carried out every year for all residents between the ages of 40 and 74, covered by national health insurance in Japan. In this study, a baseline survey was conducted among 685,889 people (42.7% male, ages 40-74) who participated in specific health check-ups from 2008 to 2014 in eight regions (Yamagata , Fukushima, Niigata, Ibaraki, Toyonaka, Fukuoka, Miyazaki, and Okinawa prefectures). Details of this study have been described elsewhere11. Of the 685,889 baseline participants, 169,910 were excluded from the study because baseline data on lifestyle information or blood tests were not available. Additionally, 399,230 participants with survival follow-up of less than 5 years from the baseline survey were excluded. Therefore, 116,749 patients (42.4% male) with known 5-year survival or mortality status were included in this study.

This study was conducted in accordance with the guidelines of the Declaration of Helsinki. This study was approved by the Yamagata University Ethics Committee (Approval No. 2008–103). All data was anonymized prior to analysis; therefore, the Yamagata University Ethics Committee waived the requirement for informed consent from study participants.


For the validation of a predictive model, the most desirable route is a prospective study on unknown data. In this study, data on the dates of health check-ups were available. Therefore, we split the total data into training and test datasets to build and test predictive models based on checkup dates. The training dataset included 85,361 participants who participated in the study in 2008. The testing dataset included 31,388 participants who participated in this study from 2009 to 2014. These datasets were separated into time and there were no overlapping participants. This method would evaluate the model in a manner similar to a prospective study and has an advantage that can demonstrate temporal generalizability. Clipping was performed for outliers of 0.01% for preprocessing, and normalization was performed.

Information on 38 variables was obtained during the health check-up baseline survey. When there were strongly correlated variables (correlation coefficient greater than 0.75), only one of these variables was included in the analysis. High correlations were found between body weight, abdominal circumference, body mass index, hemoglobin A1c (HbA1c), fasting blood glucose, and AST and alanine aminotransferase (ALT) levels. We then used body weight, HbA1c level and AST level as explanatory variables. Finally, we used the following 34 variables to build the prediction models: age, sex, height, weight, systolic blood pressure, diastolic blood pressure, urinary glucose, urinary protein, urinary occult blood, uric acid, triglycerides, lipoprotein cholesterol (HDL-C), LDL-C, AST, γ-glutamyl transpeptidase (γGTP), estimated glomerular filtration rate (eGFR), HbA1c, smoking, alcohol consumption, medications (for hypertension, diabetes and dyslipidaemia), history of stroke, heart disease and kidney failure, weight gain (more than 10 kg since age 20), exercise (more than 30 min per session, more than 2 days per week) , walking (more than 1 h per day), walking speed, eating speed, supper 2 h before bedtime, skipping breakfast, late night snacks, and sleep status.

Values ​​of each item in the training dataset for live/dead groups were compared using chi-square test, Student’s t-test, and Mann–Whitney’s U-test, and differences significant (P

(Supplementary Tables S1 and S2).

Prediction models [XGBoost]We used two methods based on machine learning (gradient reinforcement decision tree

, neural network) and a conventional method (logistic regression) to build the prediction models. All models were built with Python 3.7. We used XGBoost library for GBDT, TensorFlow for neural network and Scikit-learn for logistic regression.

Completion of missing values

The data obtained in this study contained missing values. XGBoost can be trained to predict even with missing values ​​due to its nature; however, neural network and logistic regression cannot be trained to predict with missing values. Therefore, we filled in the missing values ​​using the k-nearest-neighbor method (k = 5), and the test data was filled in using an imputer trained using only the data d ‘learning.

Parameter determination

The required parameters for each model were determined for the training data using the RandomizedSearchCV class from the Scikit-learn library and repeating the quintuple cross-validation 5000 times.

Evaluation of performancesThe performance of each prediction model was assessed by predicting the test dataset, drawing an ROC curve, and using AUC. Additionally, accuracy, precision, recall, F1 scores (the harmonic mean of precision and recall) and confusion matrix were calculated for each model. To assess the importance of the explanatory variables for the predictive models, we used SHAP and obtained SHAP values ​​that express the influence of each explanatory variable on the model output.4.12

. The workflow diagram for this study is shown in Fig. 5.

number 5

Workflow diagram of predictive model development and performance evaluation.

Sherry J. Basler