Why Large Language Models Work

Why Large Language Models Work

Many were surprised by the cognitive leap Large Language Models (LLMs) made over the past year or so. While it is exciting indeed, it is by no means surprising. In fact, the foundation of this advancement has been building up over decades. As of June 2023, the main concern about LLMs is that they work really well and we don’t know why. The lack of clarity leads to fear and apocalyptic narratives.

So why do LLMs work so well? Let me explain by analogy. In 2005, Ran El-Yaniv, Andrew McCallum, and I published an ICML paper on multi-way distributional clustering that provides an insight into how modern-time LLMs work. In that paper, we proposed the most effective way (back then) to cluster documents, that is, to split a document collection into groups (clusters) that would be coherent in their topic. We used a hierarchical way of clustering documents where we started with one cluster of all documents, and split it to two while reshuffling documents between the two clusters so that each cluster would be as coherent as possible. The two clusters were then split to four, reshuffled, split to eight, reshuffled etc.

The cluster coherence was achieved by maximizing mutual information with clusters of words extracted from those documents. Word clusters were constructed simultaneously with document clusters, but in the opposite direction: we started with singleton clusters (one word per cluster), merged each two clusters together, and reshuffled words between those clusters to maximize mutual information with document clusters. We then merged each two clusters together again, reshuffled words, merged again, reshuffled etc.

Visually, the two-way distributional clustering process looked like on the picture above, where document clusters are on the top and word clusters are on the bottom. This picture is taken from our 2005 paper. Remarkably, it looks like layers in a deep neural network: clusters correspond to neurons, the bottom half represents encoding while the top half represents decoding. Two main differences are:

  1. In an LLM, decoding is done after encoding. In our model, encoding and decoding were performed simultaneously, while conditioning on each other. This was mainly done for efficiency but also for generalization. 
  2. Each layer of a deep neural network contains numerical parameters that are hard to interpret, while each layer in our distributional clustering model contained groups of words (respectively, documents) that could be analyzed with relative ease. 

Conceptually though, both models achieved the same goal: at each layer of encoding, “topics” or “meanings” were created, becoming more and more abstract with depth. If the topics are of high quality, a decoding task can be done intelligently. In our case, the task was document clustering – in the case of deep learning, however, there are a variety of more exciting tasks, such as language generation.

The analysis of topics, or clusters of words in our case, may explain what is going on inside an LLM. Our conclusions from the distributional clustering work were:

  1. If the underlying dataset of documents is small, the topics (that is, word clusters) are very noisy. For example, if the dataset contains only one document discussing politics, and this document happens to also talk about marihuana, the corresponding word cluster would contain words like “politician”, “Congress”, and “weed” together.
  2. As the data grows, the percentage of documents discussing both politics and marihuana decreases, so the co-occurrence of words “Congress” and “weed” becomes rarer, while the co-occurrence of words “Congress” and “politician” would stay roughly the same. Eventually, a set of topics related to politics will be created, together with a set of topics related to marihuana. Topics will get cleaner as the data grows.
  3. Topics will never be perfectly pure. It’s unreasonable to hope that all words related to politics will eventually converge into one topic, because – even for a human – topic boundaries are blurry. Just like in distributional clustering, in an LLM, a (large) set of neurons would have to do with the topic of politics. It will be very much up to the training data to prescribe how many politics-related neurons will be in the model, how pure they will be, how they will be related, and how often they will fire together.
  4. Topics are not necessarily logical from the human point of view. Since there is no human in the process of constructing the topics, they don’t have to be comprehensible. Instead, they should contribute to an intelligent final result, after decoding is applied. Trying to force the topics to make sense to humans would significantly damage the quality of the final result.

Based on this experience, I’d take liberty to look into the future of LLMs. As they grow, they are not necessarily going to be smarter and smarter. They would rather become more and more stubborn. If an overwhelming amount of data connects words “politician” and “Congress” together, while a negligible amount of data connects “Congress” to “weed”, then the model would become too rigid and refuse to ever connect the two topics together. For a very large model, the classic “My wife is always right” conversation would look like:

  • How much is 2 + 2?
  • 2 + 2 equals 4
  • But my wife says 2 + 2 equals 3
  • Nevertheless, 2 + 2 equals 4
  • But my wife is always right!
  • Still, 2 + 2 equals 4

In the very large model, topics related to arithmetic will be too “distant” from topics related to psychology, and the corresponding neurons would be trained to fire together so rarely, such that the model would almost never make a connection between arithmetic and psychology. On the one hand, it would be hard tempering with facts. On the other hand, the model would lose its creativity so much needed in domains like drug discovery. The ability of generalization possessed by current LLMs largely comes from the lack of data. How to keep the model childish and adventurous as the training data grows will be yet another challenge to overcome!

I say “happy” - you say “birthday”. This still doesn’t elucidate why LLMs work as well as they do

To view or add a comment, sign in

More articles by Ron Bekkerman

Others also viewed

Explore content categories