Machine learning model of a Spanish cohort for the prediction of mortality risk from SARS-COV-2 and critical patients

The outbreak of the SARS-COV-2 pandemic has brought about a disruptive change in society worldwide at all levels. The health problems derived from the infection pose a challenge to the scientific community, as the knowledge associated with the disease is very limited. In this sense, the scientific community has focused its efforts on finding solutions, vaccines and palliatives to the pandemic, trying to speed up the process of returning to normality.1.

The rapid evolution of the pandemic, coupled with the unknown clinical features of the disease, has posed a challenge for the healthcare field. The pandemic has generated problems related to the use of hospital resources, the unexpected evolution of patients or the choice of the most appropriate treatment, taking into account the clinical state that patients already had before the illness.2.

The increase in the availability of data in the field of health allows the application of big data analysis techniques and artificial intelligence (AI)3.4. Various studies in at the cutting edge of technology Literature5 present its advantages and applicability in different areas such as decision support system to improve resource allocation in health management6 or clinical and prognostic models for the prediction of various diseases such as cancer7 or heart disease8.9. The benefits of these techniques can also be indirectly reflected in the increase in scientific publications related to the subject.tenproviding various benefits such as helping to provide better care and reducing costs11. These results show the success of these techniques in the field of health, being able to discover relevant clinical information hidden in a large amount of data regardless of the format.12,13,14 (image, text or raw data), which plays a key role during clinical decision-making. Specifically, AI techniques allow us to automate processes and quickly analyze results as long as there is enough data available. This is essential for converting data into information that allows us to respond quickly to critical cases such as the SARS-CoV-2 virus. Moreover, with the appearance of new strains15such as the Alpha (UK, Sep 2020), Beta (South Africa, May 2020), Gamma (Brazil, Nov 2020), Delta (India, Oct 2020), the newest Omicron (multiple countries, Nov-2021 ), or others still to come whose effects may vary, it is essential to be able to train specific models for specific diseases as soon as the data are available.

However, some studies16,17,18 are based on statistical techniques. These techniques have proven to be inaccurate as the volume of information increases19.20. To overcome these issues, AI techniques can analyze the large number of variables present and their impact on critical patients.

Regarding AI techniques, we can find two approaches: Deep Learning (DL) and Machine Learning (ML). Considering DL approaches, there are previous works with good results21.22. However, DL techniques present problems or challenges of model explainability. Although there are studies that cover this issue using techniques such as SHAP23.24or in the image classification model by viewing convolutional filters, the interpretability of DL models is always a hot topic25. It should be emphasized here that one of the main purposes of this article is to provide a clear set of variables that influence patient outcomes. For this reason, we propose an interpretable and explainable ML model. In our ML model, we can manage its explainability by defining the weight of each variable in the model, which allows us to validate and extract information about the variables that most influence the evolution of patients.

According to ML, in a recent systematic review of ML models constructed to predict disease course in patients or risk of mortality in patients2, the authors concluded that, of the studies analyzed in the review, many were conducted using data from Chinese patients only. This carries a risk of bias and may raise questions about the applicability and accuracy of existing ML prediction models in other patient populations that may be potentially different. Therefore, the objective of the study presented in this article is to construct and validate a model of LM for patients infected with SARS-CoV-2 and to provide information on a cohort of Spanish patients. We believe that different ML patterns on different patients from different nations are absolutely necessary. This would lay the groundwork for further research comparing and validating the course of patients from different nations and taking into consideration particular variables of different races. Obviously, this study is beyond the scope of this article. This is the main reason why there are more and more studies on different nationalities of patients.

Other studies26,27,28,29 were conducted in the early months of the pandemic. Thus, the number of samples covered is low because they use data collected for 3 months in the best case for the construction of the models. Incorporating more samples allows the population used for training to approach Gaussian normality. This allows us to draw more robust conclusions and capture the different intrinsic casuistry in any population. In this sense, our study is more robust in terms of the number of patients included, since it uses data from patients affected by the infection for approximately 8 months.

One can also find studies based on symptoms17,29,30,31,32,33,34,35 such as headache, vomiting, fever, shortness of breath, diarrhea, muscle aches and other variables as comorbidities. Symptom variables are normally obtained in primary care and stored as handwritten notes and non-tabulated information. Our approach obtains similar results and does not depend on variables that are usually collected in manual form. In addition, our model uses structured information and quality variables in a standard format in a way that facilitates its integration with hospital information systems.

Moreover, in the recent literature, one can read articles where the authors reduce the number of features of the algorithms by applying feature selection techniques28,30,31,32,34 or domain knowledge17. Although in general terms these techniques improve the accuracy of the algorithms by eliminating noise36, in cases like SARS-CoV-2 that involve complex casuistry, it is difficult to determine exactly whether the noise is real data that affects the problem under study. Infrequent combinations in the dataset can be considered an anomaly, although they have an implication on the outcome. This implies that information is lost. Since we are facing a new problem where a lot of information is unknown, we follow an agnostic approach where we use all available comorbidities in order to explore the importance of each comorbidity.

In addition, there are studies that pose the problem of grouping different diseases26 such as cancer or respiratory problems. Currently, there are more than 100 different types of cancer. Thus, our hypothesis is that different cancerous diseases will interact differently with the respiratory effects caused by SARS-COV-2. Likewise, we assume that some respiratory diseases will interact with SARS-COV-2 more severely. For this reason, we do not group diseases into their families. Instead, we explore them individually to find out their impact on patient outcomes.

Thus, the main objective of this article is to present an ML model and a case study on a cohort of Spanish patients (n = 5378). The data was obtained for 8 months of the pandemic, from February 27, 2020 to November 12, 2020. Our ML model is based on medical records to detect the probability of death of patients with SARS-COV-2 depending on the age, sex and comorbidities recorded in ICD-9 format. One of the main points of our article is that the model provided accurately predicts the probability that a patient will die during their infection. We also used regularization techniques to avoid overfitting. Additionally, another key advantage of our model is that it allows for hypothetical analysis with the inclusion of comorbidities that may arise during infection. This allows for the early detection of future and potential critical cases and therefore the most severe effects in patients infected with SARS-CoV-2 can be mitigated by taking preventive measures.

The rest of the article is structured as follows: First, the Method and Methodology section is presented. In this section, we describe all the different methodological steps applied to our case study. These steps can be summarized as follows (i) the regulation under the method has been applied and the approval by the corresponding ethics committee, (ii) the description of the datasets and characteristics, (iii) the preprocessing data, (iv) explainability of missing values, (v) training of the ML model, and finally, (vi) interpretation and explainability of the ML model. Then, the “Results” section presents the statistics on the study cohort, the results obtained by the ML algorithms and their optimization, as well as the importance of the characteristics obtained by the model. Then, the “Discussion” section discusses the advantages of our proposal and the limitations with different state-of-the-art studies presented in this “Introduction” section. Finally, the “Conclusion and future work” section summarizes the contribution, results and future challenges.

Sherry J. Basler