This retrospective multicenter study was approved by the institutional review boards, and the written informed consent was waived for this study. A review of clinical databases and the picture archiving and communication system was performed to retrospectively enroll consecutive patients from July 2013 to July 2021 at Center I, from January 2016 to June 2021 at Center II, and from January 2017 to January 2018 at Center III. The center information is shown in the Additional file 1 (Section 1). The inclusion criteria were as follows: (1) pathologically confirmed benign or borderline EOTs; (2) MRI scanning performed within at least 2 weeks before surgery. Patients were excluded based on the following criteria: (1) receiving any treatment before MRI examination and biopsy, including chemotherapy or radiotherapy; (2) lack of T2W sequences; (3) poor-quality MR images due to artifacts; (4) the tumor could not be fully displayed because of insufficient volume or too large of a tumor volume. In total, 417 patients (154 from Center I, 233 from Center II, and 30 from Center III) were enrolled in the study. Patients from Center I and Center II were stratified into the training and internal validation sets at a ratio of 8:2. Data from Center III were reserved for external validation set to evaluate the generalizability of the created models to data from separate institutions. The clinical characteristics of all patients, including age, menopausal status, parity, abdominal symptoms, carbohydrate antigen 125 (CA125), and human epididymis protein 4 (HE4), were obtained from patients’ electronic medical records. The details of the recruitment process are shown in Fig. 1.
Image acquisition and tumor segmentation
Patients scanned on various 1.5 T or 3.0 T units with phased-array coils were all included in this study. Fat-suppressed (FS) T2W images were used in this study. The scanners and imaging parameters of FS T2W are summarized in the Supplementary Materials (Additional file 1: Table S1).
Tumor volumes of interest (VOIs) containing both the cystic and solid components were manually delineated slice-by-slice on FS T2W images by using ITK-SNAP software (v. 3.8.0, http://www.itksnap.org) . Only the one with the largest maximum diameter on axial images was selected for segmentation if the tumor was multifocal in nature. Two examples of VOIs segmentation are shown in Fig. 2. Radiologist A, who had 10 years of experience in pelvic MRI diagnosis, first segmented the VOIs for all subjects. To evaluate the interobserver reproducibility, the VOIs of 30 patients randomly chosen from the training set were segmented by another radiologist (Radiologist B) who had 5 years of experience in pelvic MRI diagnosis. To assess the intraobserver reproducibility, Radiologist A repeated the segmentation procedure for all patients after one month. The interobserver and intraobserver reproducibility of VOIs was evaluated by ICCs, and ICCs > 0.80 are considered robust and reproducible . The first segmentation of Radiologist A was used to create models. Referring to one previous study , the two radiologists who performed VOIs delineation also independently assessed the following conventional MRI characteristics: (1) ascites, which was classified as none, mild (limited to the Douglas pouch), moderate (limited to the pelvic cavity), or massive (beyond the pelvic cavity); (2) margin, which was classified as well-defined or ill-defined; (3) the number of loculi, which was classified as mild (< 3) or multilocular ( ≥ 3); (4) signal intensity (SI) of the solid component on FS T2W images (compared with adjacent external myometrium), which was classified as none, low, high, or mixed; (5) SI of the cystic component on FS T2W images (compared with urinary bladder), which was classified as low, moderate, or high; and (6) the maximum diameter. Disagreements between the two radiologists were rereviewed in consensus. Some examples referred to the evaluation of signal intensity are shown in Supplementary Materials (Additional file 1: Figs. S1 and S2). The two radiologists were blinded to the histopathologic results and clinical information of the tumors when reviewing MRI images.
Before radiomics processing, normalization was used to transform arbitrary gray intensity values into a standardized intensity range; all T2W images and masks were then isotopically resampled to 3 × 3 × 3 mm3 by using B-spline interpolation. A total of 1130 radiomics features were extracted from VOIs by using the PyRadiomics package (http://www.radiomics.io/pyradiomics.html)  in Python. Most radiomics features follow the image biomarker standardization initiative (IBSI) . The custom settings and detailed information on radiomics features are included in Additional file 1 (Section 2).
ComBat harmonization was performed on the radiomics features, which is desirable before building models, as it reduces the bias caused by different scanners (Additional file 1 [Section 3]) [31,32,33]. Radiomics features after ComBat harmonization were standardized by Z-score normalization (removing the mean and scaling to unit variance). Furthermore, we applied the synthetic minority oversampling technique (SMOTE) in the training set to reduce the bias of the sample imbalance . Finally, features with ICCs < 0.8 were excluded.
Radiomics features had high dimensionality; thus, several feature selection steps were used. The Mann–Whitney U test was first performed to select statistically significant features between benign and borderline EOTs in the training set. Second, the importance weight of each feature was calculated by using Random Forest (RF) algorithm, and the correlation coefficient between every two features was calculated by Spearman correlation analysis. For any pair of features with correlation coefficients > 0.90, the one with the lowest importance weight was removed from the training data. Finally, the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm was used to solve the multicollinearity problem, by selecting diagnosis-related features with only nonzero coefficients. More information on feature selection is shown in Additional file 1 (Section 4 and Additional file 1: Fig. S3).
Models building and evaluation
We used four different machine learning algorithms to construct radiomics models, including logistic regression (LR), support vector machine (SVM), RF, and Naive Bayes (NB). The best machine learning algorithm was selected by analyzing the fitting and generalization performance. We applied the learning curve to assess the trend of the training and cross-validation scores with the increase in sample size. If both the validation and training scores converge to a stable value, the model is considered not to benefit from additional training data. Furthermore, the learning curve can be used for comparison among multiple models. The higher the training and cross-validation scores, the better the fitting performance; the smaller the gap between the training and cross-validation score, the better the generalizability.
Next, we incorporated the clinical and conventional MRI characteristics that were statistically significant after univariate analysis into the radiomics model (combined model) to explore whether this can further improve the performance. These clinical and conventional MRI characteristics were also fed into a separate model (clinic-radiological model). Multiclass variables in this study were one-hot encoded before model building. To increase the comparability of the models, we selected the same machine learning algorithm as the one used in the best radiomics model. The outcomes of these three models were comprehensively compared to explore the optimal model with the best diagnostic efficiency.
The area under the ROC curve (AUC) was used as the main indicator for model evaluation and comparison. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were also calculated. Overfitting means that the model cannot fit well on datasets other than the training data, which is an indication of the model’s poor generalizability. To reduce overfitting, all models were constructed with tenfold cross-validation. The diagnostic performance of the models was evaluated by using these indicators averaged over the tenfold cross-validation iterations. The generalizability was assessed by analyzing the AUC of each model in the internal and external validation sets.
All statistical analyses and graphic production were performed using SPSS (v. 25; IBM), R (v. 4.11), and Python (v. 3.8.5). Normally distributed continuous variables are summarized as the means (± standard deviation), and non-normally distributed continuous variables are summarized as the medians (interquartile ranges). Continuous variables were analyzed by the Mann–Whitney U tests or independent sample t tests, and categorical variables were assessed by the Chi-square test or the Fisher’s exact tests. The DeLong test was used to compare the AUCs. A two-tailed p value of < 0.05 was considered significantly different.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.