Collectio articuli: Details of the article/s

In medical informatics, artificial intelligence

The Use of
recent more powerful large language models
As Methodology procedure

Is worse Than
older, less powerful large language models

To while stability to different natural phrasings of the same question is improved, more powerful language models do not avoid answering questions, even if very difficult and, paradoxaly, do not secure areas of low difficulty

Nature. 2024 Sep 25. doi: 10.1038/s41586-024-07930-y. Epub ahead of print

[Citation]

Larger and more instructable language models become less reliable

Zhou L, Schellaert W, Martínez-Plumed F, Moros-Daval Y, Ferri C, Hernández-Orallo J

Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, ValGRAI, Valencia, Spain. jorallo@upv.es

Descriptive

The prevailing methods to make large language models more powerful and amenable have been based on continuous scaling up (that is, increasing their size, data volume and computational resources) and bespoke shaping up (including post-filtering, fine tuning or use of human feedback). However, larger and more instructable large language models may have become less reliable.

By studying the relationship between difficulty concordance, task avoidance and prompting stability of several language model families, here we show that easy instances for human participants are also easy for the models, but scaled-up, shaped-up models do not secure areas of low difficulty in which either the model does not err or human supervision can spot the errors.

We also find that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook.

Moreover, we observe that stability to different natural phrasings of the same question is improved by scaling-up and shaping-up interventions, but pockets of variability persist across difficulty levels.

These findings highlight the need for a fundamental shift in the design and development of general-purpose artificial intelligence, particularly in high-stakes areas for which a predictable distribution of errors is paramount.

Pubmed record: PMID: 39322679

Notes: 0

Theme: Medical diagnosis and decision: Human versus Artificial intelligence (AI)