# Machine Learning Models to Identify Predictors of Clinical Outcomes with First-Line Immune Checkpoint Inhibitor Therapy in Advanced Non-Small Cell Lung Cancer

### Data source and patient selection

This was a retrospective cohort study of patients with NSCLC starting ICI 1L therapy in the Flatiron Health aNSCLC database. The Flatiron Health EHR-derived database is a longitudinal database comprising anonymized structured and unstructured patient-level data organized through a technology-enabled abstraction.^{20.21}. During the study period (2015-2021), the Flatiron Health network consisted of approximately 280 US cancer clinics (~800 care sites). Data is subject to obligations to prevent re-identification and to protect patient confidentiality. The Institutional Review Board of WCG IRB, Puyallup, WA, approved the study protocol for real-world cohort data collection prior to the conduct of the study and waived the requirement for informed consent .

A cohort of patients was selected who were newly diagnosed with NSCLC between January 1, 2015 and November 30, 2020 and who met the following inclusion and exclusion criteria: age ≥ 18 years at diagnosis of NSCLC, received one or more ICI-regimens containing 1L treatment within 90 days of NSCLC diagnosis (index date = 1L onset date), had ≥ 1 PD-L1 test on or before index date, has not had positive test results or received targeted therapies for *ALK/EGFR/ROS1/BRAF/KRAS* oncogenic alteration(s) and no participation in a clinical trial.

Two clinical outcomes were assessed, the first of which was OS, defined as the time elapsed between the index date and death. Structured patient-level data (EHR, Obituaries, and Social Security Death Index) and unstructured EHR (summary) data were already linked to generate a composite mortality variable that has high sensitivity and specificity compared to in the National Death Index (NDI)^{22}. The second outcome was PFS, defined as the time from the index date to the first real-world progression event or death. Real-world progression was available based on information already extracted from medical records and defined as distinct episodes in the patient’s journey where the treating physician or clinician concluded that there was spread or worsening of the disease. sickness. Flatiron Health uses a clinician-rooted approach supported by radiology reports to assess real-world progression, as it has been reported to be the optimal and most practical method for such assessment.^{23}.

### Variable preprocessing

Several categories of candidate variables were considered in our models to predict clinical outcomes, including demographics, medical history, tumor characteristics, comorbidities, metastatic sites, types of 1L therapy, concomitant medications, and laboratory measurements. The assessment time windows, which were determined by clinical experts, varied among the variables, and their details are described below and in Table S1.

Demographic variables that were considered included age at index date (i.e., start of 1L therapy), gender, type of payer, race, and geographic region. . Medical history included year of NSCLC diagnosis, smoking status, number of different types of medical visits 90 days prior to index date, and Eastern Cooperative Oncology Group (ECOG) baseline score, which was defined as the most recent value before or on the day of the index day or the highest value if multiple ECOG scores were reported on the same day. Tumor characteristics included diagnostic status, histology, and PD-L1 expression level, which was assessed based on all valid PD-L1 percent staining results at the index date or before ; the highest percentage of PD-L1 staining was extracted for each patient. Comorbidities, presence of other primary malignancies, and site of metastasis were assessed based on all ICD-10 and ICD-9 diagnoses documented on or before the index date. Additionally, comorbidities based on ICD-10 and ICD-9 codes were summarized using the Elixhauser Comorbidity Index score and categorized into 29 different groups excluding 2 groups of metastatic cancer and solid tumor without metastasis.^{24}. Concomitant medications used in the 90 days before or on the index date were grouped according to the third level of Anatomical Therapeutic Chemical Codes (ATC3) and the number of different types of medications was entered. For example, if a patient was on abacavir (J05AF06), dolutegravir (J05AJ03), and lamivudine (J05AF05), then the ATC3 variable of J05A for that patient was 3. ATC3 codes were suppressed if the ATC3 class was taken by 50% of patients in the 90 days before or on the index date. Outliers were defined as the lowest and highest 0.1% of the distribution for each laboratory value assessed, and the lowest 10% and highest 0.1% of the distribution for each vital value assessed using empirical analysis; outliers were then defined as missing. Missing values were imputed according to the following rules: (1) imputed mode value for categorical and binary variables; (2) imputed mean value for most continuous variables; (3) zero imputed for the PD-L1 level variable, then a binary variable was introduced to indicate absence; (4) zero imputed for metastatic sites and comorbidities. If multiple values were available during the 90-day window before or on the index date, the frequency of valuations, average value, variation (i.e. standard deviation) and direction and the magnitude of the changes (i.e. the slope) were calculated. Categorical integers were used for initial stage at diagnosis with stage 0/I = 0, stage II = 1, stage IIIA = 2, stage IIIB/C = 3, and stage IV = 4. Other categorical variables were hot-coded and one category of the same categorical variable was dropped to minimize collinearity. We further excluded a set of characteristics that had >85% correlation with the other characteristics measured using the Pearson correlation.

### Models

Survival modeling for time to event prediction was necessary due to right-censoring or dropout of patients from the cohort before the onset of the event. Survival modeling requires a data set of the form (D={{left({x}_{i},{delta }_{i},{t}_{i}right)}}_{i=1}^{N} ) where N is the total number of patients in the cohort, ({x}_{i}) represents the characteristics, ({delta}_{i}) represents the indicator variable with ({delta}_{i}=1) indicating that an event has occurred and ({delta}_{i}=0) indicating right censorship, and ({t}_{i}) is either the censoring time or the event time for patient i^{25}. We used 5 different approaches to perform event time prediction in the presence of right-censored data. The predicted median survival time from the models was used for evaluation.

The CPH model is a standard semi-parametric approach that calculates the impact of a set of given characteristics on the risk of an event occurring, and assumes that the characteristics are independent^{26}. We used a penalized Cox regression while the regularization parameter and the method to handle related event times were tuned^{27.28}.

The accelerated failure time (AFT) model is a parametric model that can be used as an alternative to CPH models^{29}. Several known distributions were used for this model, including Weibull, log-normal, log-logistic, and exponential. We used quantile-quantile (QQ) plots to examine which distribution fit our 2 results, and chose the log-logistic AFT model, which views the relationship between recovery time and covariates as a linear relationship. False positive rate and penalty weights have been adjusted.

The Survival Support Vector Machine (SSVM) used in the study is able to handle right-censored survival data by combining ranking-based loss and regression-based loss, and its computational efficiency has been improved. by using kernel functions^{30}. Penalty weights, blending parameter between ranking and regression loss, and optimizers have been adjusted.

A gradient-boosted decision tree (GBDT) was used to assess whether nonlinear relationships identified by increasing model complexity would improve model performance.^{31}. We used the Cox loss for GBDT (GBDT-CPH). The learning rate, number of regression trees, maximum depth of individual regression estimators, and fraction of samples to use to fit individual regression estimators were adjusted.

DeepSurv is a state-of-the-art CPH deep neural network and survival method that can model increasingly complex relationships between patient characteristics and their risk of failure^{18}. We used modern deep learning techniques to optimize network training, including adjusting the hyper-parameters of learning rate, dropout rate, number of hidden layers, and number of nodes in each layer hidden.

Associated hyperparameters for the above models were tuned using random searches of different parameter tunings and five-fold cross-validation, and results were reported based on retained validation sets from cross-validation, also called test sets.

### Evaluation

Several measures were used to quantify the fit of the model to the data. First, we used the concordance index (c-index), which measures how well the model ranks patients based on risk score against clinical outcomes of interest^{32}. Second, we used 2 metrics derived from Haider et al.^{33} called margin loss and hinge loss. Specifically, they are metrics that can handle censored data and quantify model performance in terms of the distance between predicted delays and actual delays. Third, since most patients had an event within 2 years of the index, hinge loss and margin scores for patients with an event before 1 and 2 years after the index were also reported. to reveal whether a model could accurately predict the time to events. .

### Explainability

Understanding how models generate their predictions is important in the clinical field. We applied 2 different approaches to identify significant predictors. First, model-based importance scores were generated by different models, such as Cox regression coefficients or tree-based feature importance scores from GBDT. However, a limitation of model-based scores is that some models only indicate whether a characteristic is important, but do not show the directionality of the association, such as whether higher values lead to higher risk. To solve this problem, we used model-independent SHAP values^{11}. Specifically, we used KernelSHAP to assign SHAP values to important variables based on test data.