Why Large Language Models Work
Many were surprised by the cognitive leap Large Language Models (LLMs) made over the past year or so. While it is exciting indeed, it is by no means surprising. In fact, the foundation of this advancement has been building up over decades. As of June 2023, the main concern about LLMs is that they work really well and we don’t know why. The lack of clarity leads to fear and apocalyptic narratives.
So why do LLMs work so well? Let me explain by analogy. In 2005, Ran El-Yaniv, Andrew McCallum, and I published an ICML paper on multi-way distributional clustering that provides an insight into how modern-time LLMs work. In that paper, we proposed the most effective way (back then) to cluster documents, that is, to split a document collection into groups (clusters) that would be coherent in their topic. We used a hierarchical way of clustering documents where we started with one cluster of all documents, and split it to two while reshuffling documents between the two clusters so that each cluster would be as coherent as possible. The two clusters were then split to four, reshuffled, split to eight, reshuffled etc.
The cluster coherence was achieved by maximizing mutual information with clusters of words extracted from those documents. Word clusters were constructed simultaneously with document clusters, but in the opposite direction: we started with singleton clusters (one word per cluster), merged each two clusters together, and reshuffled words between those clusters to maximize mutual information with document clusters. We then merged each two clusters together again, reshuffled words, merged again, reshuffled etc.
Visually, the two-way distributional clustering process looked like on the picture above, where document clusters are on the top and word clusters are on the bottom. This picture is taken from our 2005 paper. Remarkably, it looks like layers in a deep neural network: clusters correspond to neurons, the bottom half represents encoding while the top half represents decoding. Two main differences are:
Conceptually though, both models achieved the same goal: at each layer of encoding, “topics” or “meanings” were created, becoming more and more abstract with depth. If the topics are of high quality, a decoding task can be done intelligently. In our case, the task was document clustering – in the case of deep learning, however, there are a variety of more exciting tasks, such as language generation.
Recommended by LinkedIn
The analysis of topics, or clusters of words in our case, may explain what is going on inside an LLM. Our conclusions from the distributional clustering work were:
Based on this experience, I’d take liberty to look into the future of LLMs. As they grow, they are not necessarily going to be smarter and smarter. They would rather become more and more stubborn. If an overwhelming amount of data connects words “politician” and “Congress” together, while a negligible amount of data connects “Congress” to “weed”, then the model would become too rigid and refuse to ever connect the two topics together. For a very large model, the classic “My wife is always right” conversation would look like:
In the very large model, topics related to arithmetic will be too “distant” from topics related to psychology, and the corresponding neurons would be trained to fire together so rarely, such that the model would almost never make a connection between arithmetic and psychology. On the one hand, it would be hard tempering with facts. On the other hand, the model would lose its creativity so much needed in domains like drug discovery. The ability of generalization possessed by current LLMs largely comes from the lack of data. How to keep the model childish and adventurous as the training data grows will be yet another challenge to overcome!
I say “happy” - you say “birthday”. This still doesn’t elucidate why LLMs work as well as they do