Researchers backed by a trio of leading US universities proposed a mechanism for evaluating the capabilities of large language models (LLMs) based on human psychology, addressing what they stated is a major problem in benchmarking caused by diverse use cases.

In a study of whether LLMs function as people expect, researchers funded by Harvard University, Massachusetts Institute of Technology (MIT) and the University of Chicago devised a method to evaluate how human generalisations impact their assessment of the AI-related technologies.

MIT explained people “form beliefs” about what we think others “do and do not know” when we interact, a principle which is then carried into our assessment of how well an LLM performs.

Researchers developed a human generalisation function by “asking questions, observing how a person or LLM responds and then making inferences about how that person or model would respond to related questions”.

If an LLM shows it can handle a complex subject, people will expect it to be proficient in related, less-complicated areas.

Models which fall short of this belief “could fail when deployed”, MIT stated.

Baseline
A survey developed based on whether participants believed a person or LLM would answer related questions correctly or incorrectly yielded “a dataset of nearly 19,000 examples of how humans generalise about LLM performance across 79 diverse tasks”.

The survey found participants were less able to generalise about how LLMs would perform compared to other people, a facet researchers believe could impact the way models are deployed moving forward.

Alex Imas, Professor of behavioural science and economics at the University of Chicago’s Booth School of Business, said the research highlighted a “critical issue with deploying LLMs for general consumer use”, because people may be put off using the models if they do not fully understand when responses will be accurate.

Imas added the study also provides something of a fundamental baseline for assessing LLM performance, specifically whether they “understand the problem they are solving” when giving correct answers, in turn helping to improve performance in real-world scenarios.