Steps to Avoid Overuse and Misuse of Machine Learning in Clinical Research

The term “overuse” refers to the unnecessary adoption of advanced AI or ML techniques where alternative, reliable, or superior methodologies already exist. In such cases, the use of AI and ML techniques is not necessarily inappropriate or unhealthy, but the rationale for such research is unclear or contrived: for example, a new technique may be proposed that does not provides no significant new answers.

Many clinical studies have used ML techniques to achieve respectable or impressive performance, as shown by area under the curve (AUC) values ​​between 0.80 and 0.90, and even > 0.90 (Box 1 ). A high AUC is not necessarily a mark of quality, as the ML model can be over-fitted (Fig. 1). When a traditional regression technique is applied and compared to ML algorithms, more sophisticated ML models often only offer marginal accuracy gains, presenting a questionable trade-off between model complexity and accuracy.1,2,8,9,10,11,12. Even very high AUCs are not guarantees of robustness, since an AUC of 0.99 with an overall event rate of

Fig. 1: Model adjustment.

Given a dataset with data points (green dots) and an actual effect (black line), a statistical model aims to estimate the actual effect. The red line illustrates a close estimate, while the blue line illustrates an overfitted ML model with excessive reliance on outliers. Such a model may appear to perform great for that particular dataset, but may not perform well in a different (external) dataset.

There is an important distinction between statistically significant and clinically significant improvement in model performance. Machine learning techniques undoubtedly offer powerful means for dealing with prediction problems involving data with high-dimensional nonlinear or complex relationships (Table 1). In contrast, many simple medical prediction problems are inherently linear, with features that are chosen because they are known to be good predictors, usually based on previous research or mechanistic considerations. In these cases, ML methods are unlikely to provide substantial improvement in discrimination.2. Unlike in the engineering context, where any improvement in performance can improve the system as a whole, modest improvements in the accuracy of medical predictions are unlikely to result in a difference in clinical action.

Table 1 Definitions of several key machine learning terms

ML techniques should be evaluated against traditional statistical methodologies before being deployed. If the goal of a study is to develop a predictive model, ML algorithms should be compared against a predefined set of traditional regression techniques for Brier score (an evaluation metric similar to mean squared error, used to check the quality of a predicted probability score), discrimination (or AUC) and calibration. The model must then be validated externally. Analytical methods and the performance measures on which they are compared must be specified in a prospective study protocol and must go beyond overall performance, discrimination and calibration to also include measures related to on -adjustment.

Conversely, some algorithms are able to say “I don’t know” when faced with unknown data.13an important but often underestimated result, because knowing that a prediction is highly uncertain can, in itself, be clinically actionable.

Sherry J. Basler