Article Text

Download PDFPDF
Assessing the performance of AI chatbots in answering patients’ common questions about low back pain
  1. Simone P S Scaff1,
  2. Felipe J J Reis2,3,
  3. Giovanni E Ferreira4,
  4. Maria Fernanda Jacob1,
  5. Bruno T Saragiotto1,5
  1. 1 Masters and Doctoral Programs in Physical Therapy, Universidade Cidade de Sao Paulo, Sao Paulo, Brazil
  2. 2 Physical Therapy Department, Instituto Federal do Rio de Janeiro, Rio de Janeiro, Brazil
  3. 3 Department of Physiotherapy, Human Physiology and Anatomy, Vrije Universiteit Brussel, Brussel, Belgium
  4. 4 Institute for Musculoskeletal Health, The University of Sydney, Sydney, New South Wales, Australia
  5. 5 Discipline of Physiotherapy, Graduate School of Health, Faculty of Health, University of Technology, Sydney, New South Wales, Australia
  1. Correspondence to Simone P S Scaff; simone.pivaro{at}uol.com.br

Abstract

Objectives The aim of this study was to assess the accuracy and readability of the answers generated by large language model (LLM)-chatbots to common patient questions about low back pain (LBP).

Methods This cross-sectional study analysed responses to 30 LBP-related questions, covering self-management, risk factors and treatment. The questions were developed by experienced clinicians and researchers and were piloted with a group of consumer representatives with lived experience of LBP. The inquiries were inputted in prompt form into ChatGPT 3.5, Bing, Bard (Gemini) and ChatGPT 4.0. Responses were evaluated in relation to their accuracy, readability and presence of disclaimers about health advice. The accuracy was assessed by comparing the recommendations generated with the main guidelines for LBP. The responses were analysed by two independent reviewers and classified as accurate, inaccurate or unclear. Readability was measured with the Flesch Reading Ease Score (FRES).

Results Out of 120 responses yielding 1069 recommendations, 55.8% were accurate, 42.1% inaccurate and 1.9% unclear. Treatment and self-management domains showed the highest accuracy while risk factors had the most inaccuracies. Overall, LLM-chatbots provided answers that were ‘reasonably difficult’ to read, with a mean (SD) FRES score of 50.94 (3.06). Disclaimer about health advice was present around 70%–100% of the responses produced.

Conclusions The use of LLM-chatbots as tools for patient education and counselling in LBP shows promising but variable results. These chatbots generally provide moderately accurate recommendations. However, the accuracy may vary depending on the topic of each question. The reliability level of the answers was inadequate, potentially affecting the patient’s ability to comprehend the information.

  • Low Back Pain
  • Internet
  • Pain

Data availability statement

Data are available on reasonable request.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Data availability statement

Data are available on reasonable request.

View Full Text

Footnotes

  • Handling editor Josef S Smolen

  • X @giovanni_ef

  • Collaborators not applicable.

  • Contributors SPSS has contributed to the conception and design of the work, the acquisition, analysis and interpretation of data for the work; to the revision of the work and for important intellectual content, and the final approval of the version to be published. FJJR has contributed to the conception and design of the work, to the data analysis, revision of the work and for important intellectual content and the final approval of the version to be published. GEF has contributed to the methods, data analysis, revision of the work and for important intellectual content, and the final approval of the version to be published. MFJ has contributed to the acquisition, analysis and interpretation of data for the work; and the final approval of the version to be published. BTS has contributed to the conception and design of the work, data analysis, revision and supervision of the work and for important intellectual content and the final approval of the version to be published. SPSS and BTS are the guarantors.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient and public involvement To ensure questions were relevant to people with LBP, we piloted them with a small group of consumer representatives (N=3) with lived experience of LBP.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.