Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for low back pain and sciatica: A cross-sectional study
Introduction
ChatGPT is a language model developed by OpenAI that is trained to generate human-like text based on large amounts of data and has the potential for role-playing during informed decisions. We aim to assess internal consistency, reliability, and accuracy of ChatGPT compared to recommendations from international clinical practice guidelines (CPGs) in providing answers to a complex clinical question on low back pain and sciatica.
Methods
This cross-sectional study compares ChatGPT answers to CPGs recommendations in diagnosis and treatment of low back pain and sciatica. All eligible recommendations were classified into ‘should do’, ‘could do’, ‘do not do’, or ‘uncertain’ categories by consensus recommendations across CPGs. Using existing CPGs’ recommendations, relative clinical questions were developed and queried to ChatGPT. We assessed (i) internal consistency of text ChatGPT answers when a clinical question was posed three times, (ii) reliability between two independent reviewers in grading ChatGPT answers into the following categories ‘should do’, ‘could do’, ‘do not do’, or ‘uncertain’, and (iii) accuracy of ChatGPT answers compared to CPGs recommendations in classifying the correct categories. Reliability was calculated using Fleiss’ kappa (κ) coefficients, whereas accuracy was measured by inter-observer agreement (IOA) as frequency of the agreements among all judgements.
Results
We found modest internal consistency of text ChatGPT answers across all three trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intra (reviewer 1: κ=0·90 standard error (se)=0·09; reviewer 2: κ=0·90 se=0·10) and inter-reliability (κ=0·85 se=0·15) between the two reviewers was “almost perfect”. Accuracy between ChatGPT answers and CPGs recommendations was slight, showing agreement in only 33% of recommendations.
Discussion and Conclusion
ChatGPT showed internal consistency in their text answers but their indications were inappropriate compared to the CPGs’ recommendations in diagnosing and treating low back pain and sciatica. Clinicians and patients should use this AI model cautiously because the system provides misleading indications on average.
REFERENCES
Collaborators GBDLBP. Global, regional, and national burden of low back pain, 1990-2020, its attributable risk factors, and projections to 2050: a systematic analysis of the Global Burden of Disease Study 2021. Lancet Rheumatol 2023; 5(6): e316-e29
Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 2023; 6: 1169595.
Khorami AK, Oliveira CB, Maher CG, et al. Recommendations for Diagnosis and Treatment of Lumbosacral Radicular Pain: A Systematic Review of Clinical Practice Guidelines. J Clin Med 2021; 10(11).
Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel) 2023; 11(6).