ChatGPT Generated Literature Review: Quod Erat Demonstrandum or
Ends Justifying the Means?
Dear Editor,
We would like to draw your attention to the increasing popularity of the
generative artificial intelligence (AI) chatbot, ChatGPT (OpenAI, 2023),
and its relationship with scientific literature. We have attempted to
replicate two literature reviews recently published in Clinical
Otolaryngology using ChatGPT, comparing results, conclusions and
references.
Lee et al. (2022): Posterior nasal neurectomy for intractable
rhinitis: A systematic review, was assessed. ChatGPT’s conclusions
generated with the same research questions were comparable. However,
ChatGPT’s references were confabulated raising questions of provenance
and quality.
Cereceda-Monteoliva et al. (2021), reviewed sarcoidosis of the
ear, nose, and throat. Again, identical research questions generated
near-identical results, including numerical values for incidence,
features, and management. One generated reference appeared to be
‘similar’ in terms of the author’s name, but the title and journal were
entirely incorrect. Of the remaining four references provided by
ChatGPT, only one was a recognisable article. Further investigation
shows ChatGPT lacks access to research databases, raising doubts about
the reliability of the conclusions it presents.
It is interesting that ChatGPT should generate correct conclusions but
with incorrect working. We are reminded of school mathematics,quad erat demonstrandum (Q.E.D.), and where incorrect working
affords you no marks regardless of a correct answer.
ChatGPT is a Large Language Model (LLM) AI. Fundamentally, it mimics
human intelligence but does not replicate it. ChatGPT does this by
analysing vast quantities of data to predict the next most likely
word in an answer – erroneously exemplified by the generated
references. A scientific literature review follows a superficially
similar process, analysing data and outputting a most likely conclusion.
Crucially, the latter involves higher-order evaluation and critical
thought based on myriad factors that seem currently out of reach for
ChatGPT in this specific use case. Readers familiar with Bloom’s
Taxonomy of Cognition will identify its relevance here[1].
Often literature review produces an already anticipated conclusion but
provides some of the highest quality evidence to base medical practice.
Therefore, with ChatGPT, the ends do not justify the means for practiced
medicine, even if the most likely worded conclusion is accurate.
However, the exponential growth of LLM AIs is extraordinary. Near-future
iterations of ChatGPT climbing to the top of Bloom’s Taxonomy are easily
imagined. Improved critical reasoning with access to accurate databases
of peer-reviewed material would substantiate an output, even if the
conclusions are unchanged. An accurate ‘show of working’ could provide a
meaningful AI-generated literature review to responsibly guide medical
practice.
Q.E.D. - Quod Erat Demonstrandum
References
- Bloom, B.S.,Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl,
D. R. (1956). Taxonomy of educational objectives: The classification
of educational goals. Handbook I: Cognitive domain. New York: David
McKay Company.
- Cereceda-Monteoliva, N., Rouhani, M. J., Maughan, E. F., Rotman, A.,
Orban, N. T., Yaghchi, C. A., & Sandhu, G. S. (2021b). Sarcoidosis of
the ear, nose and throat: A review of the literature. Clinical
Otolaryngology , 46 (5), 935–940.https://doi.org/10.1111/coa.13814
- Lee, M. L., Chakravarty, P., & Ellul, D. (2022). Posterior nasal
neurectomy for intractable rhinitis: A systematic review of the
literature. Clinical Otolaryngology , 48 (2), 95–107.https://doi.org/10.1111/coa.13991
- OpenAI. (2023). OpenAI. Retrieved from https://openai.com/