This article is part of our coverage of the latest in AI research.
Major language models such as GPT-3 have advanced to the point where it has become difficult to measure the limits of their capabilities. If you have a very large neural network you can generate articlesto write software code:and talk about feeling and lifeyou would expect it to reason about tasks and plans like a human does, right?
wrong. A study by researchers at Arizona State University, Tempe, shows that LLMs perform very poorly when it comes to planning and methodical thinking, and suffer from many of the same flaws seen in current deep learning systems.
Interestingly, the study finds that while very large LLMs such as GPT-3 and PaLM pass many of the tests intended to evaluate reasoning abilities and artificial intelligence systems, they do so because these benchmarks are either too simplistic or too flawed and can be “cheated” through statistical tricks, something that deep learning systems are bad. good at.
As LLMs break new ground every day, the authors propose a new benchmark to test the planning and reasoning capabilities of AI systems. The researchers hope their findings can help steer AI research towards the development of artificial intelligence systems that can deal with what is popularly known as “system 2 thinkingtasks.
The illusion of planning and reasoning
“Last year we evaluated GPT-3’s ability to extract plans from text descriptions — a task previously attempted with special-purpose methods — and found that off-the-shelf GPT-3 does quite well compared to the methods for special purposes,” Subbarao Kambhampati, an Arizona State University professor and co-author of the study, told TechTalks. “That naturally made us wonder what ’emergent capabilities’ – if any – GPT3 has to handle the simplest planning problems (e.g. making plans in toy domains). We immediately found that GPT3 is quite spectacularly bad in anecdotal testing.”
Intrigued by the abundance of ’em”#LLM‘s are Zero shot
‘s” papers, we wanted to see how good LLMs are at planning and reasoning about change.
tldr; off the shelf #GPT3 very bad at this..
— Subbarao Kambhampati (కంభంపాటి సుబ్బారావు) (@rao2z) June 22, 2022
An interesting fact, however, is that GPT-3 and other major language models perform very well on benchmarks designed for logical reasoning, logical reasoning, and ethical reasoning, skills previously thought to be off-limits to deep learning systems. A previous study by Kambhampati’s group at Arizona State University demonstrates the effectiveness of large language models in generating plans from text descriptions. Other recent studies include one showing LLMs can do zero-shot reasoning if provided with a special trigger phrase.
In these benchmarks and studies, however, ‘reasoning’ is often used broadly, Kambhampati believes. Basically what LLMs do is create the appearance of planning and reasoning through pattern recognition.
“Most benchmarks rely on superficial (one or two step) reasoning, as well as tasks for which there is sometimes no real ground truth (e.g., getting LLMs to reason about ethical dilemmas),” he said. “It’s possible that a pure pattern completion engine with no reasoning ability would still do well on some such benchmarks. After all, while System 2’s reasoning skills can sometimes be compiled to System 1, it’s also the case that System 1’s ‘reasoning skills’ just reflexive responses may be from patterns the system has seen in its training data, without actually doing anything akin to reasoning.”
System 1 and System 2 thinking
System 1 and System 2 Thinking was popularized by psychologist Daniel Kahneman in his book Thinking Fast and Slow. The first is the fast, reflexive, and automated way of thinking and acting that we usually do, such as walking, brushing our teeth, tying our shoelaces, or driving in a familiar area. Even much of the speech is performed by System 1.
System 2, on the other hand, is the slower thinking mode we use for tasks that require methodical planning and analysis. We use System 2 to solve calculus equations, play chess, design software, plan a trip, solve a puzzle, etc.
But the boundary between system 1 and system 2 is not clear. Take driving for example. When you learn to drive, you need to focus entirely on how to coordinate your muscles to operate the gears, steering wheel and pedals, all while keeping an eye on the road and the side and rear view mirrors. This is clearly System 2 at work. It takes a lot of energy, requires your full attention and is slow. But if you gradually repeat the procedures, you will learn to perform them without thinking. The job of driving shifts to your System 1, so you can perform it without straining your mind. One of the criteria of a task integrated into System 1 is the ability to perform it unconsciously while concentrating on another task (for example, you can tie your shoe and talk at the same time, brush your teeth and read, drive and talk, etc.).
Even many of the very complicated tasks that remain within the domain of System 2 are eventually partially integrated into System 1. For example, professional chess players rely heavily on pattern recognition to speed up their decision-making process. You can see similar examples in math and programming where after doing things over and over, some of the tasks that used to be thought carefully before come to you automatically.
A similar phenomenon can occur in deep learning systems exposed to very large data sets. They may have learned to perform the simple pattern recognition phase of complex reasoning tasks.
“Plan generation requires concatenating reasoning steps to come up with a plan, and a solid ground truth about correctness can be established,” Kambhampati said.
A new benchmark for planning testing in LLMs
“Given the excitement around hidden/emerging properties of LLMs, but we thought it would be more constructive to develop a benchmark that provides a variety of planning/reasoning tasks that can serve as a benchmark as people improve LLMs through fine-tuning and other approaches to adapt their performance to/ on reasoning tasks. This is what we end up doing,” Kambhampati said.
The team developed their benchmark based on the domains used in the International Planning Competition (IPC). The framework consists of multiple tasks that evaluate different aspects of reasoning. For example, some tasks will evaluate the LLM’s ability to create valid plans to achieve a particular goal, while others will test whether the plan generated is optimal. Other tests include reasoning about the results of a plan, recognizing whether different text descriptions refer to the same goal, reusing parts of one plan in another, shuffling plans, and more.
To conduct the tests, the team used Blocks world, a problem framework that revolves around placing a series of different blocks in a certain order. Each problem has an initial condition, an end goal, and a set of allowed actions.
“The benchmark itself is extensible and is intended to have tests from different IPC domains,” Kambhampati said. “We used the Blocks world examples to illustrate the different tasks. Any of those tasks (e.g. plan generation, target shuffling, etc.) can be performed in other IPC domains as well.
Using the benchmark developed by Kambhampati and his colleagues pair-shot learningwhere the prompt given to the machine learning model contains a solved example plus the main problem to be solved.
Unlike other benchmarks, the problem descriptions of this new benchmark are very long and detailed. Solving them requires concentration and methodical planning and cannot be tricked through pattern recognition. Even a human who wants to solve them has to think carefully about each problem, take notes, possibly make visualizations and plan the solution step by step.
“Reasons is a system-2 task in general. The collective delusion of the community was to look at those kinds of reasoning benchmarks that could probably be handled via compilation to system 1 (e.g. ‘the answer to this ethical dilemma, by pattern completion, is this’) rather than actually reasoning that is necessary for the task ahead,” Kambhampati said.
Big language models are bad at planning
The researchers tested their framework on Davinci, the largest version of GPT-3. Their experiments show that GPT-3 performs moderately on some types of scheduling tasks, but performs very poorly in areas such as plan reuse, plan generalization, optimal scheduling, and rescheduling.
“In fact, the first studies we’ve seen show that LLMs are particularly bad for anything that could be considered planning tasks, including plan generation, optimal plan generation, plan reuse or rescheduling,” Kambhampati said. “They do better at the scheduling-related tasks that don’t require chains of reasoning, like shifting goals.”
In the future, the researchers will add test cases based on other IPC domains and provide performance baselines with subjects on the same benchmarks.
“We’re also curious to see if other variants of LLMs do better on these benchmarks,” Kambhampati said.
Kambhampati emphasizes that the aim of the project is to bring out the benchmark and give an idea of where the current baseline is. The researchers hope their work opens up new windows for developing planning and reasoning capabilities for current AI systems. For example, one direction they propose is to evaluate the effectiveness of refining LLMs for reasoning and planning in specific domains. The team already has preliminary results on an instruction-following variant of GPT-3 that appears to do marginally better for the easy tasks, though it also remains around the 5 percent level for actual plan-generating tasks, Kambhampati said.
Kambhampati also believes that learning and acquiring world models would be an essential step for any AI system that can reason and plan. Other scientists, including: deep learning pioneer Yann LeCunhave made similar suggestions.
“If we agree that reasoning is part of intelligence, and we want to argue that LLMs do it, we definitely need benchmarks for generating plans there,” Kambhampati said. “Instead of taking a masterly negative stance, we provide a benchmark so that people who believe that reasoning can get out of LLMs, even without special mechanisms like world models and reasoning about dynamics, can use the benchmark to support their point of view.”
This article was originally published by Ben Dickson at: TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the bad side of technology, the dark implications of new technology and what to watch out for. You can read the original article here.