# Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury – Diagnostic and Prognostic Research

May 5, 2022

### Systematic search

We used data from a recent systematic review of multivariable prediction models based on admission characteristics (first 24 h after injury), for patients after moderate and severe TBI (Glasgow Coma Scale ≤ 12) that were published between 2006 and 2018 [6] (Supplementary Table 1 and 2). The protocol of this systematic review has been registered on PROSPERO (registration number 2016: CRD42016052100). Studies were eligible for inclusion if they reported on the development, validation, or extension of multivariable prognostic models for functional outcome in patients aged ≥ 14 years with moderate and severe TBI. There were no limitations concerning outcome measurement, provided that functional outcome was measured between 14 days and 24 months after injury.

We updated the systematic search for 2019–2021 (December 2018–June 2021). One investigator (IRH) independently screened records for possibly relevant studies based on title and abstract. Subsequently, full texts of potentially relevant articles were assessed for eligibility. In case of doubt, a second investigator (AM) was consulted.

### Study selection

We selected externally validated prediction models for moderate and severe TBI (Supplement Table 1) as previously identified by Dijkland et al. (2019) or identified through the updated search. To be included, the model development study had to report model performance in terms of discriminative ability. The external validation could be described in the same publication that described model development, or in a separate publication.

### Data extraction

One investigator (IRH) extracted data from the included studies. A check for all included studies was performed by a second investigator (AM). For the development studies, the data extraction form was based on the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies (CHARMS) checklist [8], and included the source of data, participants, outcome, sample size, predictors, missing data, model development, performance measures, and presentation. For the validation studies, data was extracted on the study design, setting, inclusion criteria, sample size, and model performance. To ensure consistency of the data extraction, the form was tested on two studies by both investigators.

If one publication reported on multiple prediction models, data extraction was performed separately for each model. Prediction models were classified as separate if they included a different set of predictors (e.g., IMPACT core, and IMPACT extended [9]). Models with identical set of predictors, but for different outcomes (e.g., mortality and unfavorable outcome) were not classified as separate models.

### Risk of bias and applicability

Risk of bias and applicability of included development studies were assessed with the Prediction model Risk Of Bias Assessment Tool (PROBAST) [7]. Judgments on high, low, or unclear risk of bias for the model development studies were made for five key domains (participant selection, predictors, outcome, sample size and participant flow, and analysis) using 20 signaling questions (Supplementary Table 3). We also used a short form based on the PROBAST including 8/20 signaling questions, which was recently proposed and validated, and showed high sensitivity (98%) and perfect specificity to identify high risk of bias (RoB) [10].

To determine if there was a reasonable number of outcome events in a logistic regression (PROBAST item 4.1), the lowest number of events in the smallest group of two outcome frequencies (patients with the outcome versus without the outcome) was divided by the total degrees of freedom used during the whole modeling process. The total degrees of freedom was based on the number of variables (continuous variables) or categories (categorical variables) in the model; henceforth referred to as Events Per Parameter (EPP). All candidate predictors were considered as part of the modeling process, including those not selected for the multivariable model based on univariable regression analysis or selection procedures. We assumed a reasonable number of outcome events when EPP ≥ 10.

Concerns regarding the applicability of an included study to the review question can arise when the population, predictors, or outcomes of the included study differ from those specified in the review question [7]. Applicability was judged based on three key domains (participant selection, predictors, and outcome).

Two reviewers (IRH and AM) independently completed the PROBAST checklist (Supplementary Table 3). A third independent reviewer (LW) scored two of the model development studies (17%). Discrepancies between reviewers were resolved through discussion or by consultation with a senior member (DvK) of the review team. The RoB, applicability, and usability were reported per study, in which we presented one assessment for models described in the same publication, but with a different set of predictors (e.g., IMPACT core, and IMPACT extended) and models with identical set of predictors, but for different outcomes (e.g., mortality and unfavorable outcome). An overall judgment about risk of bias and applicability of the prediction model study was reached based on a summative rating across all domains according to the PROBAST criteria (low, high, or unclear).

### Usability

A model’s usability in research and clinical practice was rated for its presentation with sufficient detail to be used in the intended context and target population. The model was deemed usable in research if the full model equation or sufficient information to extract the baseline risk (intercept) and individual predictor effects was reported, and usable in clinical practice if an alternative presentation of the model was included (e.g., a nomogram, score chart, or web calculator).

### Relatedness

For validation studies, we assessed the similarity between the derivation population and the validation population for each study, which we refer to as “relatedness.” To judge relatedness, we created a rubric, aiming to capture various levels or relatedness by dividing the validation studies into three categories: “related,” “moderately related,” and “distantly related” (6) (Supplementary Table 4). The rubric contained three domains: (I) setting (Intensive Care Unit, Emergency Department, Ward; Country; Not specified), (II) inclusion criteria, and (III) outcome assessment and timing. Studies that did not meet the domain about setting were judged “moderately related,” whereas studies that did not meet the domains about inclusion criteria and/or outcome assessment and timing were judged “distantly related.”

### Model performance

Model performance was summarized in terms of discrimination and calibration. In prior studies, discrimination was assessed in terms of the c statistic or area under the operating receiver curve (AUC), which ranges between 0.50 (no discrimination) and 1.0 (perfect discrimination). In prior studies, calibration was typically assessed with the calibration intercept a, which indicates whether predictions are systematically too low or too high, and should ideally be 0. Prior studies also reported the calibration slope b which indicates whether the overall prognostic effect of the linear predictor of the developed model is over- or underestimated, and should ideally be 1.

### Relation between methodological quality and model performance

To quantify the relation between methodological quality at development and model performance at external validation, we first calculated the change in discriminative performance between the derivation cohort and the validation cohort. The percent change in discrimination was calculated as follows:

$$%mathrm{change} mathrm{in} mathrm{discrimination}=frac{left(mathrm{validation} mathrm{AUC}-0.5right)-left(mathrm{derivation} mathrm{AUC}-0.5right)}{left(mathrm{derivation} mathrm{AUC}-0.5right)} mathrm{x} 100$$

For instance, when the AUC decreases from 0.70 in derivation to 0.60 in validation, this drop of 0.10 points represents a 50% loss in discriminative power (since 0.50 represents the lowest possible value). We calculated the median and interquartile range (IQR) of the change in discrimination for low, high and unclear RoB models.

We used generalized estimated equations (GEE) to estimate the effect of the RoB classification (Low; High; Unclear RoB based on the original PROBAST) on the observed change in discrimination, taking into account the correlation between validations of the same model and similarity in study design between the development and validation study (Similar; Cohort to trial; Trial to cohort).

### Evidence synthesis

A synthesis was provided for the included development and external validation studies. Extracted data, RoB, applicability, and usability were presented in summary tables and where appropriate in graphical representations. Figures were constructed with R software version 3.6 (R Foundation for Statistical Computing, Vienna, Austria).

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.