Abstract
Objective
Osteoporosis (OP) is a major public health problem that causes significant mortality and morbidity. Therefore, early diagnosis is essential. We aimed to predict OP by combining computed tomography (CT)-based radiomic data of the clivus with machine learning (ML) algorithms.
Materials and Methods
In this retrospective study, 140 cases that underwent dual energy X-ray absorptiometry (DEXA) and craniofacial CT within one year of each other between 2015 and 2021, were examined at our institution. According to DEXA T-scores, cases were divided into three groups: 30 OP, 33 osteopenia, and 77 normal. Trabecular components of the clivus were segmented, and 1023 radiomic features were extracted using 3D Slicer. Radiomic outputs consist of features from original, Laplacian of Gaussian, and wavelet transform filtered images. Voxel resampling was standardized as 1x1x1 mm³. Orange Data Mining program was used for ML. Relief and fast correlation-based filter were used for feature reduction. K-nearest neighborhood, decision tree, random forest, logistic regression, support vector machine (SVM), Naive Bayes, and neural network were used as classifiers. Area under the curve (AUC), sensitivity, specificity, receiver operating characteristic curve, and confusion matrix were used for performance evaluation.
Results
In binary classification as OP and non-OP, neural network achieved the highest success in predicting OP (AUC 0.87). In the binary classification of BMD as low BMD and normal BMD, SVM was the best in predicting low BMD cases (AUC: 0.82). In the ternary classification of BMD as OP, osteopenia, and normal, Naive Bayes achieved the highest performance in distinguishing OP (AUC: 0.9) and osteopenia (AUC: 0.69). The Hounsfield Units values of the clivus were significantly different between low BMD and normal BMD cases (p<0.001).
Conclusion
ML algorithms using CT-based radiomic features of the clivus can predict OP and provide BMD information.
Introduction
Osteoporosis (OP) is a serious public health problem with the increasing elderly population worldwide. In developed countries, 30% of all postmenopausal women have OP, and 50% of these patients experience one or more osteoporotic fractures in their lifetime (1). Vertebral and femoral fractures are more common than other bone fractures and are a significant cause of morbidity and mortality. Therefore, early diagnosis and fracture risk prediction are important in OP diagnosis (2, 3).
Dual energy X-ray absorptiometry (DEXA) is the gold standard diagnostic method for OP diagnosis. However, erroneous results may be obtained with this two-dimensional examination in cases with osteodegenerative bony changes, vertebral instrumentations, and aortic calcifications. In recent years, quantitative computed tomography (CT) has emerged as a new diagnostic method in OP diagnosis, successfully calculating bone density and mass (4). However, since it is a relatively expensive technique, researchers have searched for alternative methods to predict OP, such as detecting morphological changes in bone structures through conventional imaging techniques and analyzing histogram features of bone structures through software, without the need for new hardware. There are many studies in periodontology and implant dentistry with these purposes (5-9). Lespessailles et al. (10) reported that the combined evaluation of bone tissue analysis and bone mineral density (BMD) is superior to the evaluation of BMD alone in the diagnosis of OP. Kawashima et al. (11) retrospectively extracted the histogram features of the sphenoid triangle, mandibular condyle, and clivus from cranial CT images and reported significant results in the diagnosis of OP.
Radiomics, a new image-processing approach, has been developed in recent years. Hundreds of features from medical images that the human eye cannot distinguish are obtained quantitatively (12). Radiomics achieves successful results in the differential diagnosis of tumors, determining the prognosis, and evaluating the response to treatment (13-15). In recent years, the number of studies related to radiomics and artificial intelligence in OP has been steadily increasing. (16-25). Machine learning (ML) is a subset of artificial intelligence. It is used in the medical field to calculate large and complex data sets and assist in medical decision-making.
He et al. (26) showed that magnetic resonance imaging (MRI) of the lumbar spine and radiomics models could be used in the diagnosis of OP. Rastegar et al. (27) obtained radiomics data from DEXA images and created ML models that can be used in the classification of bone mineral loss.
We aim to investigate the usability of radiomics and ML algorithms in OP prediction. The reason why we chose clivus is that studies focusing on clivus for OP prediction are very rare. The only study we encountered was published by Kawashima et al. (11). Unlike this histogram analysis-based study, we used radiomic outputs and ML algorithms, which consist of a much larger number of high-level tissue features.
Materials and Methods
Cases with DEXA and craniofacial region CT (brain, neck, maxillofacial, and paranasal sinus CT) imaging within a maximum interval of one year between 2015 and 2021 were scanned retrospectively. Age and gender were not considered as exclusion criteria. CT images with motion artifacts, IV contrast, and slice thickness of more than 1 mm were excluded from the study. Finally, a study group with 140 cases was obtained.
DEXA scan was performed with Lunar Prodigy (model 8743, GE Lunar, Madison, WI, USA). The patient height and weight were recorded. Anterior-posterior lumbar vertebrae and femur BMD are routinely measured. Body regions with implants were excluded during imaging.
The DEXA scan used L1-4 and the femur as the basis for T-scores. The lowest T-score was used to group cases. The cases were classified as “osteoporosis” if the T-score was <-2.5, “osteopenia” if it was between -2.5 and -1, and normal if it was >-1.23 Binary classification was made as OP and non-OP (osteopenia + normal), low BMD (OP + osteopenia), and normal BMDs, and ternary classification was made as OP, osteopenia, and normal.
CT scans were performed with a 64-slice multidetector CT (Aquillon 64, Toshiba, Otawara, Japan). The parameters used in imaging are Pitch factor 0.6-0.9, rotation time 0.5-0.75 seconds, tube voltage 120 kV, tube current 150-250 mAs, and slice thickness 0.5-1 mm.
3D Slicer 4.11.2 (www.slicer.org) program was utilized for the segmentation process. After anonymization, CT images were obtained in DICOM format and imported into 3D Slicer. An experienced radiologist manually segmented trabecular bone components of the clivus. The petrooccipital fissure laterally and the hypoglossal canal inferiorly limited the segmentation borders. Dorsum sella and cortical bone were excluded from the segmentation (Figure 1).
Laplacian of Gaussian image filters with two sigma values (0.5 mm and 2.5 mm) and wavelet transform filters were used for image filtering before radiomic feature extraction to create a high-throughput dataset. Voxel size for resampling was defined as 1x1x1 mm³ for standardization.
A total of 1023 features were obtained, including 18 first-order features, 24 GLCM (Gray Level Co-Occurrence Matrix), 14 GLDM (Gray Level Dependence Matrix), 16 GLRLM (Gray Level Run Length Matrix), 16 GLSZM (Gray Level Size Zone Matrix), 5 NGTDM (Neighbouring Gray Tone Difference Matrix) based features, 93 features from Laplacian of Gaussian filtered images with sigma value of 0.5 mm, 93 features from Laplacian of Gaussian filtered images with sigma value of 2.5 mm, and 744 features from wavelet transformed images. Detailed mathematical descriptions of radiomic features are available in the pyRadiomics library (https://pyradiomics.readthedocs.io/en/latest/features.html).
Orange Data Mining Tool Version 3.27 (https://orange.biolab.si) was used for feature reduction and classification models. One scoring method among information gain, information gain ratio, Gini decrease, ANOVA, chi² (x²), ReliefF, and fast correlation-based filter (FCBF) was used for feature selection. The best combination of the feature selection method and the number of features to be used was determined by the best-performing ML algorithm: The one with the highest area under the curve (AUC) after numerous tests. Stratified 10-fold cross-validation technique was used for validation.
K-nearest neighborhood, decision tree, random forest, logistic regression, support vector machine (SVM), Naive Bayes, and neural network were used as ML algorithms. AUC, classification accuracy (CA), sensitivity (recall), specificity, F1 score, precision, receiver operating characteristic (ROC) curve, and confusion matrix were used to evaluate ML model performances.
The Local Clinical Research Ethics Committee of Balıkesir University approved this study on 03.11.2021 with the decision number 2021/249.
Statistical Analysis
Statistical analysis was performed in the IBM SPSS 22.0 (SPSS Inc., Chicago, IL, USA) program. The Kolmogorov-Smirnov test was used to determine whether the data was normally distributed. Independent variables were shown as mean and standard deviation. The Tukey’s HSD posthoc test was used to determine the relationship between BMD groups. Pearson and Spearman correlation tests evaluated the relationship between continuous independent variables. Dependent variables were evaluated with the chi-square test.
The Hounsfield Units (HU) values of the clivus were measured by drawing the largest region of interests (ROI) covering the trabecular bone from three consecutive axial CT slices, and their arithmetic mean was calculated for each case. Whether the mean HU values were discriminative in detecting the BMD group was evaluated with AUC, cut-off, sensitivity, and specificity parameters by performing ROC analysis. P<0.05 was considered significant in all statistical results.
The flow diagram is summarized in Figure 2.
Results
In our study, a total of 140 cases consisting of 124 women and 16 men aged between 33-91 years were included. Cases were divided into three groups consisting of 30 OP, 33 osteopenia, and 77 normal cases according to T-scores. No statistically significant relationship was found between gender, age, and OP due to the low number of cases and the inhomogeneous age distribution. However, when compared according to T-scores, the mean T-scores of men (0.11) were significantly higher than the mean T-scores of women [(-1) (t(133)=-2.2, p=0.024]. BMI values were significantly lower in the OP group compared to the normal group (p=0.002) (Table 1). No statistically significant difference was found when the osteopenia vs. normal group and OP vs. osteopenia group comparisons were made.
First, cases were divided into two groups: OP and non-OP (osteopenia + normal). The feature selection method was chosen as ReliefF. 10 out of 1023 features were selected. In the classification process, the best-performing classifier predicting OP was neural network (AUC=0.87, CA=0.86) (Table 2). 102 of 110 non-OP cases were correctly identified, resulting in a very high specificity value (specificity 0.93). Some classifiers showed higher specificity values, such as SVM and logistic regression. However, these classifiers have lower reliability due to their lower sensitivity and F1 scores. The ROC curves of the ML algorithms are given in Figure 3.
The other binary classification was performed between cases with low BMD (osteopenia + OP) and normal BMD. We aimed to predict the decrease in BMD with ML algorithms. Sixteen features were selected from the database with ReliefF. In the classification process, SVM showed the most successful performance (AUC: 0.82, CA: 0.79) (Table 3), correctly predicting 46 of 63 patients with abnormal BMD and 65 of 77 patients with normal BMD. Other performance metrics of SVM were calculated as sensitivity 0.73, specificity 0.84, F1 score 0.76, and precision 0.79. All performance metrics of SVM to predict low BMD were higher than the other algorithms. The ROC curves of the ML algorithms are given in Figure 4.
As a final ML classification step, cases were divided into three groups: Osteopenia, OP, and normal. FCBF method was applied, and the most optimal seven features were selected. In this ternary classification, the Naive Bayes algorithm was the best-performing classifier in distinguishing OP (AUC: 0.9, CA: 0.86) (Table 4), correctly predicting 22 of 30 cases with OP and 66 of 77 normal cases. Sensitivity was 0.73, and specificity was 0.89. Some classification methods, such as SVM, logistic regression, and random forest, reach higher specificity. However, these algorithms’ sensitivities and F1 scores lag behind the Naive Bayes algorithm. The ROC curves of the algorithms for estimating OP in the ternary classification consisting of OP, osteopenia, and normal cases are given in Figure 5.
In the ternary classification, the performance of ML algorithms in detecting cases with osteopenia is low (Table 5). The highest performance was obtained with the Naive Bayes algorithm (AUC: 0.69, CA: 0.76), predicting 11 of 33 osteopenic patients. 15 were misclassified as normal, and seven as OP.
When the mean HU values of clivus from three axial slices were calculated by the ROI method, a moderate positive correlation was found between HU values and T-scores (r²=0.45 p<0.001). The cases were divided into three groups: OP, osteopenia, and normal. The mean HU value was 103 (74.9-131.1 with a 95% confidence interval) in the OP group, 113.8 (88.9-138.7 with a 95% confidence interval) in the osteopenia group, and 192 (168-215,9 with a 95% confidence interval) in the normal group (Table 6). Significant differences were found in the values measured between the low BMD (OP + osteopenia) and the normal group (p<0.001). No significant relationship was found between the mean HU values in the OP and osteopenia groups. ROC analysis was performed to determine the success of the classical HU measurement method in predicting the low BMD group (Figure 6). The AUC value was 0.75 (0.67-0.83 with a 95% confidence interval), the cut-off value was 137 HU, and the sensitivity and specificity values were 0.6 and 0.72, respectively.
Finally, the volumetric mean HU values obtained from the segmentation of the clivus were examined. The original first-order mean values among the radiomic features, which express the volumetric mean HU value, were used without extra processing. There was no significant correlation between volumetric mean HU values and OP, osteopenia, and normal groups.
Discussion
High AUC values, such as 0.9 and 0.87, were obtained in the OP estimation using radiomics and ML algorithms. Osteopenia prediction performance was lower than OP prediction performance but at an acceptable level, at 0.82 (AUC). The combined use of radiomics and ML algorithms was significantly superior to HU values measured using the traditional ROI method in detecting OP and low BMD.
Apart from being two-dimensional imaging and using ionizing radiation, the most significant disadvantage of DEXA is the possibility of superimposition of dense structures such as soft tissues, metallic instruments, osteodegenerative changes, and atherosclerotic calcifications, which may cause BMD to be miscalculated. It is mentioned in the literature that the use of CT imaging in such cases can help diagnose missed OP (28, 29). We chose the clivus for this study because it is less prone to degeneration and is included in the field of view of common CT scans such as brain CT.
In the literature, there are efforts to develop an alternative diagnostic tool due to the limitations of DEXA. Many studies report a positive correlation between T-scores and HU values obtained from bone CT scans, such as lumbar and wrist CT scans (30-36). Alawi et al. (37) reported a positive correlation between DEXA T-scores and HU values of lumbar vertebrae from abdominal CT images. Their study measured mean attenuation values as 115 HU in osteoporotic cases, 120 HU in osteopenic cases, and 174 HU in normal cases. While the difference between the abnormal BMD and normal groups was statistically significant, there was no statistically significant difference between the OP and osteopenia groups (37). Similar mean attenuation values were measured in our study: 103 HU in the OP group, 113 HU in the osteopenia group, and 192 HU in the normal group. Decreases in mean HU values in the low BMD group were also statistically significant in our study. Our study also had no significant difference between the OP and osteopenia groups. According to the ROC analysis, the group with low BMD was correctly diagnosed with a cut-off value of 137 HU with 72% specificity and 68% sensitivity. Considering that half of the insufficiency fractures in the population occur in osteopenic women (38), identifying patients with low BMD may be more important than distinguishing osteopenia and OP.
In a study conducted with a small number of patients (29 normal, 29 OP), Kawashima et al. (11) extracted two-dimensional radiomic features from CT images of bilateral greater wings of sphenoid, bilateral mandibular condyles, and clivus using the ROI method. The various types of texture features extracted from craniofacial trabecular bones, such as histogram features, GLCM features, and GLRL features, were found to be associated with OP. It is also mentioned that the clivus, one of the three skull base structures examined in the study, stands out as being less affected by degenerative findings (11).
Rastegar et al. (27) extracted radiomic features from lumbar and femoral DEXA images with the ROI method and analyzed them with ML algorithms. Moderate diagnostic performance (AUC) values ranged from 0.5 to 0.78 in distinguishing OP, osteopenia, and normal groups (27).
In their retrospective study, Lim et al. (39) showed that the ML models using radiomic features obtained from abdominopelvic CT images can predict femoral OP. The proximal femur was automatically segmented, including the cortex. The number of radiomic features was limited to 41, consisting of semantic features, first-level tissue features, GLCM, and wavelet transform features. They used the Gini decrease for feature reduction and the random forest algorithm for classification. The cases were divided into two: 70% were used for the training dataset and 30% for the validation dataset. The random forest algorithm successfully predicted OP with 95% specificity and 80% sensitivity in the validation group. In addition, this study used 5-fold cross-validation. It is recommended to use 5 or 10 folds in the literature. We used 10-fold cross-validation technique in our study. Unlike this study, we did not divide the cases into training and validation datasets due to the limited number of cases.
In a recent article by Fang et al. (20), they mention that 2D transfer learning and 3D deep learning techniques have shown excellent performance in screening for OP in chest CT scans. In another recent article, it was found that in opportunistic OP screening using chest CT scans, the three-dimensional segmentation of the thoracic vertebral body and the subsequent radiomics outputs showed similar performance to ML models. The AUC values are similar to those in our article (AUC: 0.8-0.9) (21).
In another study regarding osteoporotic fracture estimation, using microstructural femoral MRI data and fracture risk assessment tool (FRAX) data together with ML algorithms was superior to using MRI data and FRAX data alone (40). A study conducted in India proposed that an automated diagnostic technique for low bone mass is possible using radiogrammetric measurements and texture features from radiography images together with a three-layer supervised artificial neural network (41).
Study Limitations
The main limitation of our study, apart from its retrospective nature, is the low number of patients. A larger patient group is needed for the use of training and external validation groups. In addition, the patient population was obtained from a specific region, and the findings may not be generalized worldwide. In our study, BMD was classified according to DEXA T-scores. Therefore, due to the nature of DEXA, erroneous BMD and T-scores may have been obtained, which may have misled the statistical results. In future studies, it will be possible to compare the performances of radiomics scores and ML algorithms with DEXA by grouping them as those with and without osteoporotic fractures. Using automatic segmentation can be beneficial in terms of standardization and saving time. Although the variety and number of algorithms we use are higher than most studies, it is a fact that there are more ML algorithms available to use. The DEXA and CT imaging time interval has been accepted as a maximum of one year, and this period can be kept shorter. In addition, the systemic diseases and the drugs used were not considered.
Conclusion
Our study showed that OP and osteopenia can be accurately detected using CT-based radiomic features of clivus and ML. We also found that clivus CT HU values correlated positively with DEXA T-scores.