Anil Ananthaswamy in Quanta: A theory developed by Sanjeev Arora of Princeton University and Anirudh Goyal, a research scientist at Google DeepMind, suggests that the largest of today’s LLMs are not stochastic parrots. The authors argue that as these models get bigger and are trained on more data, they improve on individual language-related abilities and also develop new ones by combining skills in a manner that hints at understanding — combinations that were unlikely to exist in the training data.
This theoretical approach, which provides a mathematically provable argument for how and why an LLM can develop so many abilities, has convinced experts like Hinton, and others. And when Arora and his team tested some of its predictions, they found that these models behaved almost exactly as expected. From all accounts, they’ve made a strong case that the largest LLMs are not just parroting what they’ve seen before.
“[They] cannot be just mimicking what has been seen in the training data,” said Sébastien Bubeck, a mathematician and computer scientist at Microsoft Research who was not part of the work. “That’s the basic insight.”
More Data, More Power
The emergence of unexpected and diverse abilities in LLMs, it’s fair to say, came as a surprise. These abilities are not an obvious consequence of the way the systems are built and trained. An LLM is a massive artificial neural network, which connects individual artificial neurons. These connections are known as the model’s parameters, and their number denotes the LLM’s size. Training involves giving the LLM a sentence with the last word obscured, for example, “Fuel costs an arm and a ___.” The LLM predicts a probability distribution over its entire vocabulary, so if it knows, say, a thousand words, it predicts a thousand probabilities. It then picks the most likely word to complete the sentence — presumably, “leg.”
Initially, the LLM might choose words poorly. The training algorithm then calculates a loss — the distance, in some high-dimensional mathematical space, between the LLM’s answer and the actual word in the original sentence — and uses this loss to tweak the parameters. Now, given the same sentence, the LLM will calculate a better probability distribution and its loss will be slightly lower. The algorithm does this for every sentence in the training data (possibly billions of sentences), until the LLM’s overall loss drops down to acceptable levels. A similar process is used to test the LLM on sentences that weren’t part of the training data.
A trained and tested LLM, when presented with a new text prompt, will generate the most likely next word, append it to the prompt, generate another next word, and continue in this manner, producing a seemingly coherent reply. Nothing in the training process suggests that bigger LLMs, built using more parameters and training data, should also improve at tasks that require reasoning to answer.
But they do. Big enough LLMs demonstrate abilities — from solving elementary math problems to answering questions about the goings-on in others’ minds — that smaller models don’t have, even though they are all trained in similar ways.
“Where did that [ability] emerge from?” Arora wondered. “And can that emerge from just next-word prediction?”
Connecting Skills to Text
Arora teamed up with Goyal to answer such questions analytically. “We were trying to come up with a theoretical framework to understand how emergence happens,” Arora said.
The duo turned to mathematical objects called random graphs. A graph is a collection of points (or nodes) connected by lines (or edges), and in a random graph the presence of an edge between any two nodes is dictated randomly — say, by a coin flip. The coin can be biased, so that it comes up heads with some probability p. If the coin comes up heads for a given pair of nodes, an edge forms between those two nodes; otherwise they remain unconnected. As the value of p changes, the graphs can show sudden transitions in their properties. For example, when p exceeds a certain threshold, isolated nodes — those that aren’t connected to any other node — abruptly disappear.
Arora and Goyal realized that random graphs, which give rise to unexpected behaviors after they meet certain thresholds, could be a way to model the behavior of LLMs. Neural networks have become almost too complex to analyze, but mathematicians have been studying random graphs for a long time and have developed various tools to analyze them. Maybe random graph theory could give researchers a way to understand and predict the apparently unexpected behaviors of large LLMs.
The researchers decided to focus on “bipartite” graphs, which contain two types of nodes. In their model, one type of node represents pieces of text — not individual words but chunks that could be a paragraph to a few pages long. These nodes are arranged in a straight line. Below them, in another line, is the other set of nodes. These represent the skills needed to make sense of a given piece of text. Each skill could be almost anything. Perhaps one node represents an LLM’s ability to understand the word “because,” which incorporates some notion of causality; another could represent being able to divide two numbers; yet another might represent the ability to detect irony. “If you understand that the piece of text is ironical, a lot of things flip,” Arora said. “That’s relevant to predicting words.”
To be clear, LLMs are not trained or tested with skills in mind; they’re built only to improve next-word prediction. But Arora and Goyal wanted to understand LLMs from the perspective of the skills that might be required to comprehend a single text. A connection between a skill node and a text node, or between multiple skill nodes and a text node, means the LLM needs those skills to understand the text in that node. Also, multiple pieces of text might draw from the same skill or set of skills; for example, a set of skill nodes representing the ability to understand irony would connect to the numerous text nodes where irony occurs.
The challenge now was to connect these bipartite graphs to actual LLMs and see if the graphs could reveal something about the emergence of powerful abilities. But the researchers could not rely on any information about the training or testing of actual LLMs — companies like OpenAI or DeepMind don’t make their training or test data public. Also, Arora and Goyal wanted to predict how LLMs will behave as they get even bigger, and there’s no such information available for forthcoming chatbots. There was, however, one crucial piece of information that the researchers could access.
Since 2021, researchers studying the performance of LLMs and other neural networks have seen a universal trait emerge. They noticed that as a model gets bigger, whether in size or in the amount of training data, its loss on test data (the difference between predicted and correct answers on new texts, after training) decreases in a very specific manner. These observations have been codified into equations called the neural scaling laws. So Arora and Goyal designed their theory to depend not on data from any individual LLM, chatbot or set of training and test data, but on the universal law these systems are all expected to obey: the loss predicted by scaling laws.
Maybe, they reasoned, improved performance — as measured by the neural scaling laws — was related to improved skills. And these improved skills could be defined in their bipartite graphs by the connection of skill nodes to text nodes. Establishing this link — between neural scaling laws and bipartite graphs — was the key that would allow them to proceed. More here.