Medical diagnosis and decision: Human versus Artificial intelligence (AI)
DISEASE INTERVENTION COMPARISON RESULTS
Nature. 2024 Sep 25. doi: 10.1038/s41586-024-07930-y. Epub ahead of print Descriptive
IN medical informatics, artificial intelligence The Use of
recent more powerful large language models
As Methodology procedure
Is worse Than
older, less powerful large language models
To while stability to different natural phrasings of the same question is improved, more powerful language models do not avoid answering questions, even if very difficult and, paradoxaly, do not secure areas of low difficulty
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969 Randomized Controlled Trial, Diagnostic
IN medical informatics, clinical decision support systems, artificial intelligence The Use of
artificial intelligence (AI), chatbots, large language models, ChatGPT-4, as diagnostic help for clinicians
As Diagnostic Tool
Is equal Than
conventional medical information sources, as diagnostic help for clinicians
To modify diagnostic performance on clinical vignettes. However, ChatGPT-4 alone was better than physicians in finding the right diagnostic (71% AI VS 63% physicians)
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2145-2151. doi: 10.1007/s00405-023-08423-w Controlled Trial (non-randomized)
IN medical informatics, clinical decision support systems, artificial intelligence The Use of
artificial intelligence (AI), chatbots, large language models, ChatGPT-3.5
As Methodology procedure
Is worse Than
human written medical structured text, UpToDate®
To obtain answers to clinical questions with accuracy (mean 0.25 in a scale of 0-2 with ChatGPT) and usefulness (mean 1.0 ChatGPT VS 2.63 UpToDate in a scale of 1-3). ChatGPT 3.5 was limited to 2021
NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z Systematic Review
IN medical informatics, clinical decision support systems, artificial intelligence, generative, diagnosis The Use of
generative artificial intelligence models
As Diagnostic Tool
Is equal Than
human physicians, non expert, but often worse than expert physicians
To modify diagnostic accuracy on vignette medical cases: 52% overall
JAMA. 2023 Dec 19;330(23):2275-2284. doi: 10.1001/jama.2023.22295 Randomized Controlled Trial
IN medical informatics, clinical decision support systems, artificial intelligence, machine learning, diagnosis The Use of
non-biased artificial intelligence model in support to clinical diagnosis of 3 conditions: pneumonia, heart failure, and chronic obstructive pulmonary disease
As Diagnostic Tool
Is better Than
clinical diagnosis alone and, ever more, than biased artificial intelligence models
To improve diagnostic acccuracy on clinical vignettes: 73% clinical alone VS 76% with AI predictions support VS 77.4% with AI support + explanations VS 62-64% with biased AI models support
JAMA Intern Med. 2016 Dec 1;176(12):1860-1861. doi: 10.1001/jamainternmed.2016.6001 Clinical Trial (non-controlled, non-randomized)
IN medical informatics, clinical decision support systems, diagnosis The Use of
computer symptoms checkers, artificial intelligence
As Diagnostic Tool
Is worse Than
physicians, human (trained) intelligence
To find the correct diagnosis - on clinical vignettes - in the top 3 diagnoses listed (84% physicians vs 51% computer)
CMAJ. 2019 Dec 2;191(48):E1332-E1335. doi: 10.1503/cmaj.190506 Review (Narrative)
IN medical informatics, clinical decision support systems, diagnosis, medical thinking, diagnostic reasoning, cognition The Use of
artificial intelligence, machine learning
As Diagnostic Tool
Is worse Than
human intelligence
To accurately reach a general diagnostic decision. Currently only effective for highly targeted tasks
BMJ. 2021 Oct 20;375:n2281. doi: 10.1136/bmj.n2281 Systematic Review
IN medical informatics, clinical decision support systems, machine learning, diagnosis, prognostic The Use of
current machine learning based diagnostic / prediction models
As Diagnostic Tool
Is bad Than
no comparison here
To make accurate diagnostic / predictions: most studies on these models show poor methodological quality and are at high risk of bias
NEJM AI 2025 July 15;2(8) DOI: 10.1056/AIcs2401155 Descriptive, Cross-Sectional Study
IN medical informatics, keeping up-to-date medical knowledge systems, artificial intelligence The Use of
large language models (LLM), GPT-4o, Gemini 1.5 Pro, Llama 3.1, even fine-tuned
As Methodology procedure
Is bad Than
no comparison here
To integrate relevant information from new FDS drug approvals, patient records, and updated medical guidelines