🚀#ByteSizedAI🚀
Chapter - 2: Models & Tokens
In chapter 1, we explored ChatGPT, Large Language Models (LLMs), and why they're called Transformers. Today, let's dig deeper. We'll start with the model itself, the foundation of everything.
What is a Model?
At its core, a model is a function(think of it as a vending machine) that takes a "prompt" (your query in the chat interface, expressed as a series of tokens) and gives you an output by predicting the next token based on it's learned knowledge. Models typically have billions of "parameters" (think of them as adjustable settings like volume and channel buttons on your TV remote) which are fine-tuned during training to identify patterns like word relationships and sentence structures. These parameters guide the prediction process in a model(we'll explore parameters in much detail later)
So, in simple terms, a model takes an input and predicts the next token by calculating probabilities, selecting the one with the highest likelihood. This happens repeatedly to generate a complete response. This is the reason LLM's are known as "probabilistic" in nature and not "deterministic" (there is a way to change that too by something known as "Temperature", but that's for another day).
Hold on, probability is Math and ChatGPT generates text 🤔
Yes, it’s all math—specifically, high dimensional vectors (ignore for now). In machine learning, everything starts with tokens, which are numeric representations(known as "Token ID's") of words (or parts of words).
Ok, so What’s a "Token"?
A token is either a word or part of it—and it’s case-sensitive. Think of it as the currency of operation for an LLM (again going back to the vending machine analogy, tokens are similar to the coins you put in the vending machine). Your input query is a series of tokens and the response from the LLM is again a series of tokens. For instance, in GPT-4 (the current model behind ChatGPT) “Hello” and “HELLO” are different tokens: "Hello" is one token (ID: 9906), but "HELLO" breaks into two (IDs: 51812 for "HEL" and 1623 for "LO"). The process of converting text into tokens is called Tokenization. The maximum number of tokens in a model is finite. I like to call it the "Universe of Tokens"! 🙂
The Universe of Tokens
Each model has a finite number of tokens—think of it like its vocabulary. For example, GPT-4 has 100,277 possible tokens. So when you ask ChatGPT, "What is the tallest building in the world?", it breaks the phrase into tokens (from it's universe of 100,277 tokens), assigning each token a unique ID. These sequence of IDs is what is then sent for processing.
Recommended by LinkedIn
So, Is Tokenization Just Mapping Words to Numbers?
In simple terms, yes! Tokenization maps words (or parts of words) to unique numeric IDs. For example, your prompt of "What is the tallest building in the world?" will become: 3923, 374, 279, 82717, 4857, 304, 279, 1917, 30 in GPT-4.
Why Aren’t Tokens Full Words?
Efficiency is the main reason. Storing and processing entire words is expensive, so engineers use a data optimization technique called Byte Pair Encoding (BPE) to reduce input size without losing meaning.
Ok, So How Does BPE Work?
At its core, BPE compresses text by merging frequently occurring character pairs into new IDs. For example, "Hello" in binary is 01001000 01100101 01101100 01101100 01101111, which is shortened to 48 65 6c 6c 6f in bytes. BPE then combines frequent byte pairs like "6c" and "6f" into a single ID, say "2699", reducing the length of the input. The result: "Hello" will be represented as 48 65 6c 2699 and the optimization continues. Similarly the "lo" in Cello will also be represented with "2699". Simple but brilliant!
In summary: A model processes your input and predicts one token at a time based on probability. And Tokens are numeric representations of words or parts of words.
It may seem technical, but I hope this breakdown helps. Let me know if you have any questions or if I missed something. We’ll dive deeper into terms like "parameters" and "training" in future posts, so stay tuned!
Until next time! Ciao!
P.S. Want to see how tokens are split and assigned IDs? Check out Tiktokenizer and experiment with different models. For OpenAI’s tokenizer, visit OpenAI Tokenizer Tool.
Love how you break this down! AI can seem complex, but simple analogies like this make it so much easier to grasp.
Hey Janak, thanks for the fantastic article! I really appreciated how you broke down complex ideas like Tokens into something as relatable as coins in a vending machine - such a clear and simple analogy. The way you tied it all back to math, especially with the peek into probability and high-dimensional vectors, was a great touch too. It sparked a thought for me amid an ongoing debate with my wife, a secondary school math teacher. She’s frustrated that students are losing interest in math, especially now that tools like ChatGPT can solve equations for them. My take is that the teaching content and methods need a refresh - farm contour problems don’t resonate with city kids! Imagine if we reframed math education around the tech they use daily, like ChatGPT or TikTok, and built questions and equations around the math powering those tools, just like you’ve started to unravel here. It could make math feel alive and relevant again. Thanks for the inspiration- great work!
Janak Sevak curious to hear which methods you used to learn the content. Anything you recommend in addition to your posts is a good read?
Tokens are the fuel of language models, shaping how AI interprets and generates text. Optimizing them means better, more efficient AI responses. Janak Sevak