Benign and malignant diagnosis of spinal tumors based on deep learning and weighted fusion framework on MRI – Insights into Imaging

May 10, 2022

Image data

The final pathological diagnosis reports of consecutive patients with spinal tumors visiting the cooperative hospital between January 2006 and December 2019 were retrospectively reviewed with approval from the Institutional Review Board (IRB). This study included sagittal MRI images collected from 585 patients with spinal tumors (259 women, 326 men; mean age 48 ± 18 years, range 4–82 years), including 270 benign and 315 malignant patients. All patients had definite pathological results confirmed by trocar biopsy or surgery and were divided into a training set (n = 445; 180 benign, 265 malignant) and a testing set (n = 140; 90 benign, 50 malignant), as shown in Table 1. The training set included metastases and primary spinal tumors, whereas the testing set only included primary spinal tumors. There were 2150 sequences obtained from 585 patients, including 1625 sequences for training and 525 sequences for testing, and the slice thickness ranged from 3 to 7 mm. Each patient underwent T1 (T1WI) and T2 (T2WI, FS-T2WI) sequences. Four radiologists and one spine surgeon annotated the tumor regions of these images with rectangles using LabelMe [7] and checked the labeled regions with each other to ensure reliability. There were 20,593 annotated images, of which 15,778 were for training and 4815 for testing. Each patient had an average of four sequences, and each sequence had an average of nine labeled images. The benign and malignant regions of these annotated tumor regions were determined based on the patient’s pathological report.

Our dataset is a complex spinal tumor dataset with more than 20 histological subtypes, as shown in Fig. 1. It should be noted that our cooperative hospital is the largest spine tumor center in our country, which has received a large number of spine tumor referrals and has performed a large number of spine tumor operations every year. Therefore, our focus included spinal tumors and some neurogenic tumors that extend to or affect the spine structure (such as schwannoma and neurofibroma) [8, 9], and intradural and intramedullary tumors were further referred to the Department of Neurosurgery. The tumors were located in different vertebrae, including the cervical, thoracic, lumbar, and sacral vertebrae, as shown in Table 2. Diagnosing such a complex spinal tumor dataset is challenging.

Proposed framework

This study proposes a multi-model weighted fusion framework (WFF) based on sagittal MRI sequences, which can combine the tumor detection model, sequence classification model, and age information statistic module to diagnose benign and malignant spinal tumors at the patient level, as shown in Fig. 2, where ({p}_{b}) and ({p}_{m}) in Fig. 2 represent the probability of benign and malignant tumors, respectively. First, we used Faster-RCNN [10] to detect the tumor region in each MRI image and provide a rough probability of being benign or malignant. Subsequently, a sequence classification model was applied to classify the detected tumor regions to obtain sequence-level results. Finally, a weighted fusion decision was made according to the results of the above two models and age information for the final diagnostic results. Four-fold cross-validation was applied to the training set to train and validate the WFF, and the appropriate hyperparameters of the deep models and fused weights were selected.

Detection model for tumor localization and rough classification

This study used a Faster-RCNN with tri-class as the tumor detection model. With the limited labeled tumor regions, the MultiScale-SelfCutMix method [11] was used for data augmentation, which randomly extracts the labeled tumor regions and scales the width and height with a factor from 0.5 to 1. Scaled tumor regions were randomly placed in the original image near the spinal region. The detection model was divided into a feature extraction network (FEN), feature pyramid network (FPN) [12], region proposal network (RPN), and region of interest (ROI) extraction module. The FEN extracted image features which may contain tumor information, using ResNeXt101 [13] as the backbone network, which is an upgraded version of ResNet101. We also added deformable convolution [14] to ResNeXt101 to adapt it to various shapes of the tumor regions. Five scales including 1/4, 1/8, 1/16, 1/32, and 1/64 of the original image were used to extract different receptive field feature information, as shown in Fig. 3, and the number of feature maps was 128, 256, 512, 1024, and 2048, respectively.

The FPN was used to fuse the five different scale features. Subsequently, the RPN generated a certain number of candidate boxes that may contain tumors, and the ROI adjusted the size of the selected candidate boxes to identify the tumors as benign or malignant. Non-maximum suppression (NMS) [15] was used to determine the final location of the tumor and the probability of being benign or malignant. Figure 4 shows the results of the proposed detection model. The green boxes and labels indicate the benign tumor and its probability, respectively, the red boxes indicate the malignant tumor, and the yellow boxes indicate the ground truth.

Sequence classification model for benign and malignant diagnosis

The tumor detection model locates and roughly identifies tumor regions of every image from the same patient, which may result in false positives. Continuous frames contain more contextual information, which is useful for accurate diagnosis. Images in each sequence correspond to a continuous tumor region; therefore, we proposed a sequence classification model based on ResNeXt101 to further classify benign or malignant tumors.

In the training stage, we selected the largest labeled tumor region in the sequence and obtained N continuous regions with this size and location as the tumor region of all images in the whole sequence, and then rescaled the size to (112times 112times N) pixels. Extraction was repeated if the labeled images in the sequence were less than N. To expand the training data, there was a 50% probability of randomly extracting images with tumor regions and a 50% probability of extracting images according to the index of Digital Imaging and Communications in Medicine (DICOM). The different sample rates were used to maintain a balance between benign and malignant samples during training, which can prevent the model from overfitting a certain tumor category. In the testing stage, based on the detected tumor regions from the above tumor detection model, we selected the largest detected tumor region and obtained N continuous regions of this size and location as the tumor region of all images in the whole sequence. The size was rescaled to (112times 112times N) pixels. Multiple adjacent tumor regions of the sequence were used as the input, and the probability of a benign or malignant of the sequence was the output from the sequence classification model.

Age information for benign and malignant diagnosis

We determined the relationship between the probability of malignant or benign tumors and the age of each patient in our training set. Figure 5 shows that the probability of malignancy increased with age, and the probability of malignancy generally increased to approximately 50% over the age of 40 years and almost 100% over the age of 80 years. We used the statistical probability of benign and malignant tumors in different age groups as a reference for patient-level diagnoses.

Multi-model weighted fusion strategy

To further improve the diagnostic performance for benign and malignant tumors, we proposed a multi-model weighted fusion strategy, as shown in Eq. (1).

$$P_{i}^{j,p} = lambda_{1} times D_{i}^{j,p} + lambda_{2} times M_{i}^{p} + lambda_{3} times A_{p}$$

(1)

where (P_{i}^{j,p}) represents the final benign and malignant probabilities of the j-th image of the i-th sequence, where (D_{i}^{j,p}) represents the probability from the tumor detection model with the j-th image of the i-th sequence of the patient, (M_{i}^{p}) represents the probability from the sequence classification model with the i-th sequence of the patient, and (A_{p}) represents the probability based on the patient’s age. (lambda_{1} , lambda_{2} , lambda_{3}) are the weights of the three terms.

The benign and malignant tumor categories of all images in each sequence were obtained by using Eq. (1), and the category with the largest proportion was selected as the sequence category. Finally, the category with the largest proportion of all sequences was selected as the benign or malignant category for this patient.

Metrics

All the models were trained on an Intel E5-2640 CPU and an NVIDIA GTX1080Ti GPU. Samples of malignant tumors were considered positive. Area under the curve (AUC) [16], accuracy (ACC), sensitivity (SE), and specificity (SP) were used as evaluation metrics. ACC, SE, and SP are defined in Eqs. (2), (3), and (4), respectively. It should be noted that our task was to diagnose tumors based on early images of patients. This is a classification task that uses deep learning. The AUC, ACC, SE, and SP are the common metrics used to measure the classification effect. Evaluation methods such as RECIST are not applicable to our task.

$${text{ACC}} = frac{{{text{TP}} + {text{TN}}}}{{{text{TP}} + {text{FN}} + {text{TN}} + {text{FP}}}}$$

(2)

$${text{SE}} = frac{{{text{TP}}}}{{{text{TP}} + {text{FN}}}}$$

(3)

$${text{SP}} = frac{{{text{TN}}}}{{{text{TN}} + {text{FP}}}}$$

(4)

To show the diagnostic level of radiologists, spine surgeons, and our model at the same time, we invited three doctors to make a diagnosis based on the images and age information of patients in the test set, including one radiologist (D1: 18 years’ experience) and two spine surgeons (D2: 24 years’ experience, D3: 8 years’ experience).