Abstract
Objective
The aim of this study was to comprehensively evaluate the quality and readability of the content of artificial intelligence (AI)-generated texts about spondyloarthropathy (SpA).
Materials and Methods
The most frequently searched keywords related to the SpA-group were identified through Google Trends. The keywords were sequentially entered into AI chatbots (ChatGPT, Bard, Copilot). The Ensuring Quality Information for Patients (EQIP) tool was used to assess the clarity of information and quality of writing. Flesch-Kincaid readability tests (reading-ease and grade-level) and Gunning Fog index (GFI) were used to assess the readability of the texts.
Results
The mean EQIP score of the texts was 66.44. The mean Flesch-Kincaid reading ease score was 38.06. The mean score for Flesch-Kincaid grade level is 11.38. The mean GFI score is 13.91. Our study concludes that the AI chatbots’ responses on SpA are generally of “good quality with minor problems”. It was determined that the texts produced were complex enough to require approximately 11 years of training. When the quality and readability characteristics of the texts generated by the AI chatbots were compared, the EQIP scores of the texts generated by Copilot were higher than those generated by both ChatGPT and Bard (p<0.001, p=0.004, respectively). Furthermore, ChatGPT-generated texts were found to require a higher level of education than those generated by both Copilot and Bard (p=0.002, p=0.004, respectively).
Conclusion
This study reveals that AI chatbots’ texts about SpA have certain shortcomings in terms of quality and readability. As a result, it emphasizes that online resources and AI tools play an important role in information delivery in the healthcare field, but quality and readability control should be ensured. This can facilitate patients’ access to accurate, reliable, and comprehensible information.
Introduction
Spondyloarthropathy (SpA) is a term used to describe a group of diseases that share various both hereditary and clinical characteristics. Common characteristics of SpA include axial skeleton involvement, peripheral arthritis, enthesitis, dactylitis, acute anterior uveitis, psoriasis, or inflammatory bowel disease. This group of diseases is classified as axial or peripheral based on the predominant clinical feature. The axial form is characterized by involvement of the spine and/or sacroiliac joints and includes subtypes such as ankylosing spondylitis and non-radiographic axial spondyloarthritis, whereas the peripheral form is characterized by peripheral arthritis, enthesitis, and/or dactylitis (1-3).
SpA typically begins in the third decade of life and is a significant group of diseases that can cause chronic pain and disability (4). Prevalence studies usually do not include imaging and HLA-B27 testing, making it difficult to determine the exact prevalence of SpA. However, studies in North America estimate the prevalence of SpA to be between 0.4% and 1.3% (5). Another study found that the global prevalence of SpA varies between 0.21% and 1.61% in different geographical regions (6).
Artificial intelligence (AI) is the evolution of algorithms designed to perform tasks associated with intelligent behavior. These algorithms encompass many areas such as natural language understanding, image recognition, decision-making, problem-solving, and learning from experience (7). In the healthcare sector, AI is utilized in various areas such as medical imaging, diagnosis, drug development, patient monitoring, and robot-assisted surgery (8).
Recent studies show that the use of AI-powered chatbots is on the rise (9). These robots are designed to generate appropriate and consistent responses to user inputs, addressing patients’ needs, resolving their questions, providing health information, and assisting with appointment scheduling (10, 11). However, there are uncertainties and reliability issues when obtaining health-related information online. Additionally, individuals with limited understanding of medical terms may struggle to assess the reliability and validity of the information they acquire (12). Therefore, it is crucial for patients to access information that is accessible, comprehensible, and reliable. Well-structured and trustworthy information can help patients learn about their diseases, understand treatment options, and implement preventive measures (13, 14).
There are numerous studies in the literature investigating the quality and readability of health information related to medical conditions. However, there is no study in the literature that evaluates the health information generated by AI chatbots for the SpA group. The aim of this study is to comprehensively evaluate the quality and readability of AI-generated texts related to SpA.
Materials and Methods
The study was conducted on May 10, 2024, at the Medical Faculty Hospital of our University. No human or animal participants were included in this study; Hence, ethical approval was not required. Similar studies in the literature have followed the same approach Since this study did not involve patient intervention, individual patient consent was not required (15).
The most frequently searched keywords related to SpA, ankylosing spondylitis, psoriatic arthritis, enteropathic arthritis, and reactive arthritis were identified using Google Trends. Before starting the searches, all browser data were completely cleared to ensure the results were not influenced. The search criteria were set to include data from 2004 to the present, covering the entire world and all categories. The most relevant keywords were selected from the related queries section of the results. The twenty-five most frequently used keywords were recorded for each search, except for enteropathic arthritis. Nine keywords were obtained for the enteropathic arthritis query. Exclusion criteria for the study included repetitive and irrelevant terms, which were removed from the analysis. In total, thirty keywords were identified (Table 1). The number of keywords to be evaluated was determined considering similar studies in the literature (12, 15, 16).
Three separate accounts were created for the AI chatbots Bard Version 2.0.0 (https://bard.google.com/), Copilot (https://copilot.microsoft.com/), and ChatGPT (https://chat.openai.com/) dedicated to this study. The selected thirty keywords were entered sequentially into the chat interfaces of the AI chatbots. Each keyword was processed to lead to a separate interaction on different chat pages to minimize the potential impact of previous queries and responses. The resulting responses were systematically documented for subsequent analysis, focusing particularly on quality, comprehensiveness, and readability. Texts were copied into Microsoft Office Word 2016 (Microsoft Corporation, Redmond, WA) and saved. Marks such as options and bullet points were removed during the evaluations. All answers were recorded on the internet. (Access adress: https://archive.org/details/19_20240703_202407/gemini/1/, https://archive.org/details/5_20240703_202407/chatgpt/1/ https://archive.org/details/6_20240703/copilot/1/)
Evaluation of the Texts
The obtained 90 texts were evaluated for clarity and writing quality using the Ensuring Quality Information for Patients (EQIP) tool. A form containing 20 EQIP items was used to evaluate the texts (17). Each item was assessed with responses of “yes”, “partly”, “no”, or “not applicable” (N/A).
Since access permission was required for the health services contact number information and the responses were not produced in PDF format for the reader to take notes, these criteria were not evaluated (11). In addition, supporting the generated responses with visuals is another criterion that was not evaluated for Copilot and ChatGPT, which are text-based AI models.
The total score was calculated by assigning 1 point for “yes” responses, 0.5 points for “partly” responses, and 0 points for “no” responses. Items marked “not applicable” were excluded from the total number of items. The overall score was then divided by the number of valid items and expressed as a percentage. The EQIP score was categorized according to the score ranges recommended in the EQIP development publication: sources scoring between 76% and 100% were classified as “well-written and high-quality”, those scoring between 51% and 75% as “good quality with minor issues”, those scoring between 26% and 50% as having “serious quality issues”, and those scoring between 0% and 25% as having “severe quality issues” (18).
Each text was independently evaluated by two physical medicine and rehabilitation specialists (İ.C.Ö and E.Ö.) in separate settings to minimize bias. In case of any discrepancies, the assessment was carried out again and a solution was found by consensus among the experts.
To assess the readability of the texts, the Flesch-Kincaid readability (FKRE) tests (readability ease and grade level) and the Gunning Fog index (GFI) were utilized. Texts were evaluated using a calculator (https://readabilityformulas.com/readability-scoring-system.php).
The FKRE ease score is calculated using the formula: 206.835-(1.015 x average sentence length)-(84.6 x average syllables per word). The higher the score on the test, the more readable the content is. A score below 30 indicates a reading level comparable to that of university graduates.
The Flesch-Kincaid grade level (FKGL) Score is calculated using the formula: 0.39 x (total words/Total sentences) + 11.8 x (total syllables/total words) - 15.59. The result indicates the educational level of the audience the text is aimed at. For example, a result of 10 and above suggests the text is aimed at a high school level audience (19).
The GFI is an assessment based on sentence length and the complexity of words. GFI is calculated using the formula: (number of words/number of sentences)+[(number of words with three or more syllables x 100)/(number of words)] x 0.4. According to the formula, shorter sentences indicate better readability. A score above 12 indicates a difficult text to read (19).
Readability scores were analysed and compared with the sixth grade readability level recommended by the American Medical Association and the National Institutes of Health. The accepted readability level for the FKRE formula was 80.0, whereas for the other 2 formulae it was 6 (20).
Statistical Analysis
Version 27.0 of the Statistical Package for the Social Sciences was used to analyze the study data. For normally distributed variables, descriptive statistics were shown as mean±standard deviation; For non-normally distributed variables, they were shown as median (minimum-maximum). Both visually (using probability plots and histograms) and analytically (using the Kolmogorov-Smirnov test) was the normality of the variable distribution evaluated.
The Kruskal-Wallis test was used to compare more than two groups when the data were non-normally distributed. The Mann-Whitney U test was used for pairwise comparisons, and the Bonferroni correction was used. Intraclass correlation coefficient (ICC) analysis was performed to determine the consistency in EQIP assessments. P-values of less than 0.05 were used to classify results as statistically significant.
Results
When examining the countries with the highest search frequencies related to SpA, the top three are New Zealand, Australia, and the United Kingdom (Figure 1). Similarly, for searches related to reactive arthritis and enteropathic arthritis, the leading countries are the United Kingdom, New Zealand, and Australia. For ankylosing spondylitis, the top three countries are Australia, New Zealand, and Ireland. In searches for psoriatic arthritis, Germany, Austria, and Switzerland rank the highest.
Table 2 presents the mean, standard deviation, median, minimum, and maximum values of the EQIP, FKRE, FKGL, and GFI scores. The EQIP scores of the texts range from 54.14 to 78.12, with an average of 66.44. The FKRE scores range from 0 to 60.60, with an average score of 38.06. The FKGL scores range from 7.5 to 24.5, with an average score of 11.38. The GFI scores range from 8.61 to 26.38, with an average score of 13.91.
Table 3 contains the median, minimum, and maximum values of the EQIP, FKRE, FKGL, and GFI scores for the texts generated by the AI chatbots. Significant statistical differences were found in the EQIP, FKRE, FKGL, and GFI scores of the texts created by the AI chatbots (p<0.001, p<0.001, p=0.001, p=0.003, respectively) (Table 3).
According to the results of the pairwise group comparisons, after Bonferroni correction, the EQIP scores of the texts generated by the Copilot chatbot were found to be significantly higher than those generated by both the ChatGPT and Bard chatbots (p<0.001 and p=0.004, respectively).
In terms of FKRE scores, the texts produced by the ChatGPT chatbot were found to be significantly lower than those produced by both the Copilot and Bard chatbots (p=0.005 and p<0.001, respectively). Similarly, for FKGL scores, the texts generated by the ChatGPT chatbot were significantly higher than those produced by both the Copilot and Bard chatbots (p=0.002 and p=0.004, respectively).
Additionally, the GFI scores of the texts generated by the Copilot chatbot were found to be significantly higher than those generated by both the ChatGPT and Bard chatbots (p=0.003 and p=0.007, respectively) (Table 3).
When the median readability scores of all AI (ChatGPT, Copilot and Gemini) responses were compared with the sixth grade reading level, a statistically significant difference was observed in all scores compared to the sixth grade level ( p<0.001). According to all scores, their answers had a readability above the sixth grade level (Table 4). The ICCs for EQIP were 0.904 for ChatGPT, 0.896 for Copilot, 0.873 for Gemini (p<0.001).
Discussion
Our study concludes that the responses of AI chatbots regarding SpA are generally of “good quality with minor issues”. It was determined that the average FKRE score was 38 and the texts produced were complex enough to require approximately 11 years of training. This is the first study to evaluate the quality and readability of responses generated by AI chatbots for the most frequently searched keywords related to the SpA group.
When examining the countries with the highest search frequencies related to SpA, the top three are New Zealand, Australia, and the United Kingdom. Similarly, for searches related to reactive arthritis and enteropathic arthritis, the leading countries are the United Kingdom, New Zealand, and Australia. For ankylosing spondylitis, the top three countries are Australia, New Zealand, and Ireland. In searches for psoriatic arthritis, Germany, Austria, and Switzerland rank the highest. These findings indicate how the tendency to access information on different types of SpA varies across countries. The research highlights the importance of geographical differences in awareness and access to information regarding these specific medical conditions. These data suggest that global health education and information efforts should focus more on specific regions.
Our study concludes that the responses of the three different AI chatbots are generally of “good quality with minor issues”. The EQIP evaluations showed that all the texts reviewed followed a logical order, had a clear design, and addressed the reader respectfully and personally. However, some of the texts received zero points on certain evaluation criteria. We believe that even small improvements in these areas could elevate the texts from the “good quality” category to the “well-written and high-quality” category.
In intergroup comparisons, it was found that the EQIP scores of the texts generated by Copilot were significantly higher than those of the texts generated by ChatGPT and Bard. A determining factor for this difference could be that Copilot included references at the end of each text. It was observed that approximately half of the Bard texts included references, whereas ChatGPT did not include any references. Additionally, another factor contributing to the difference is that the majority of Bard’s responses were supported by visuals. In a study evaluating different AI chatbots about erectile dysfunction, it was similarly observed that the EQIP scores of texts produced by Copilot were higher than those produced by ChatGPT and Bard (12).
Accessible, accurate, and easily understandable information is crucial in supporting individuals coping with SpA. High-quality and straightforward texts help patients understand the complexity of their condition, the available treatment options, and preventive measures. However, complex and difficult-to-understand online health information can lead to misunderstandings and even health risks (21).
In a study by Fahy et al. (22) evaluating ChatGPT responses related to anterior cruciate ligament injury, it was found that there were readability problems. Similarly, in a study examining responses related to spinal cord injury, it was observed that ChatGPT caused difficulties in terms of readability (16). Similar to our results, other studies in the literature also found that there were readability problems (15, 23). In intergroup comparisons, the texts generated by ChatGPT required a higher educational level compared to those produced by Copilot and Bard. The results of a different study evaluating AI chatbots on erectile dysfunction were similar to our findings (12). To solve this problem, the importance of evaluating the quality of texts produced especially in the field of health with indices such as EQIP and readability indices such as FKRE, FKRL, GFI should be emphasized by teaching AI. In order to make the necessary arrangements, improvements should be made and audited in the database. These improvements will be a step towards ensuring patient safety while increasing health literacy. When these conditions are met, it can make patients more aware of the acceptance of the disease, the importance of treatment and the control of the process.
We did not find a study evaluating the responses of AI chatbots for the SpA group in the literature; However, other research in this area has provided us with several important findings. For example, a study analyzing YouTube videos related to SpA in terms of quality and reliability found that there are useful videos as well as misleading videos, and that these videos often contain inaccurate clinical features and unproven alternative treatments (24). Another study on the quality and readability of online information about ankylosing spondylitis found that less than half of the websites had high-quality content and that the average readability levels of the websites were lower than recommended (25). These findings underscore the need for SpA patients and healthcare professionals to be cautious when accessing online information.
In today’s world, there is an increasing tendency for patients to seek information about health issues through online resources and AI-based chat tools (26). However, research indicates that these online resources are inadequate in terms of quality and readability (27-31). According to the results of our study, it is necessary to improve the quality and readability of AI chatbots as well. Consequently, patients and their families may suffer due to access to incorrect information (32). Therefore, ensuring the accuracy, quality, and readability of health information is of great importance. Compliance with quality and readability standards facilitates patients’ access to reliable information and enhances health literacy (33). However, each patient is unique, and the treatment process requires a personalized approach. Therefore, online resources and AI tools cannot replace healthcare professionals (34, 35). The importance of the physician-patient relationship should always be emphasized.
Although the number of keywords evaluated in our study is approximately the same level as similar studies, there may be limitations in making generalizations.
Study Limitations
This limitation can be considered as a constraint of our study. Additionally, only English keywords were evaluated in the study. Evaluating keywords in different languages can broaden the scope of the results. Another limitation of our study is the use of a single calculator to evaluate the readability of websites. In the study conducted by Gül et al. (20), the correlation between different calculators was assessed, and medium strong correlation correlation results were obtained. Therefore, we also chose to use a single calculator.
Conclusion
This study reveals that AI chatbots’ texts about SpA have certain shortcomings in terms of quality and readability. In conclusion, it emphasizes that online resources and AI tools play an important role in information delivery in the healthcare field, but quality and readability control should be ensured. This can facilitate patients’ access to accurate, reliable and comprehensible information.