AI - Solution looking for a Problem to solve
I know... heretic, heathen, "what is he talking about ?!". It's really my kneejerk response to something I can't understand with my simple technology background. In the past ten years we've had some major innovations, what with "cloud", and "crypto", and now "AI". Understanding these propositions at the 60,000 ft level is relatively easy, but I needed to understand the bare metal detail in order for me to extrapolate the real proposition. Here's my journey.
I'm an avid software developer (C# all the way), and use Visual Studio as my Integrated Development Environment (IDE) for all my coding and debugging. For a few years now, autocompletion has been amazing, whereby you're typing the first letter on a new line of code, and wham! it proposes an entire statement or code block based on previous code and the next character you type. Most times I'm thinking "How did you guess that !". I hit tab and off we go to the next chunk of code.
I've also tried some vibe coding using prompts with LLMs like Anthropic's Claude. I would say I've had limited success with this, with anything from modules that compile but don't function, modules that flat won't compile, and, modules with fictitious methods and classes. My last vibe coding experiment was to write some AI code. I spent two days trying to get 20 lines of code to work. It was doomed from the outset. It was at this point I made the most productive decision in my AI journey.... Just learn and do it old style.
What prompted me to look at Generative AI was a conversation with my friend Alwin Stephen , an entrepreneur, who was explaining that he works with his assistant with whom he bounces his ideas off and generally collaborates. His assistant turned out to be Generative AI, but the Generative AI in vanilla mode is too much of a generalist, so, the answer to this is to apply Retrieval-Augmented Generation (RAG) and give Gen AI some domain expertise. After a quick demonstration, I was hooked. It's just amazing. To expedite my journey I went to the no-code route and used N8N to explore some of the features of RAG. I used this article to explore the ideas. This AI Starter Kit involved creating an AI workflow with an agent based on documents you upload. However, although I'd got it working, I had no idea how/why it did what it did. I let it drop for about a month.
Then, my esteemed #marketdata SME, Vishal Shah , explained he was leading a team to build a prototype, exploring how internal intellectual property could be leveraged to benefit an organisation. This was RAG again! I wanted to help, but he was prototyping in Python, and because of my religion (Microsoft), and, old-dog new-tricks, my journey had to be conducted in C#. Time to code !
Where to start though? Yes indeed.
Large Language Models (LLMs) are built by many providers such as Anthropic, Google, OpenAI, DeepSeek, NVIDIA. They come in a variety of flavours and capabilities.... There's small ones, large ones, fast one, slow ones, multi-modal focused on images, text, video, etc. There's Reasoning models, non-Reasoning models. There's locally deployed, hosted and LLMs as a Service. Take a look at this LLM Leaderboard to get the full effect of choice.
One of the key datapoints for understanding an LLMs capability is the number of parameters, the amount of training data. Imagine that you take all your learning data (internet artefacts) and run them through a process of creating a neural network, each (hidden) layer of the neural network contains a node with a weighting called a parameter that helps navigate from the input (layer) to the output. Creating these neural networks can be hugely expensive and time consuming, but the more parameters the more capable the LLM.
Recent model costs include;
With the model(s) built, then there's the computational resource to run the model. If you host in your home lab, you're going to need GPUs with suitable amounts of RAM, otherwise, consider a cloud provider. Somewhere along the way you'll have to make an investment to benefit from using an LLM. My choice of LLM is Ollama, which I run on a 32 core AMD workstation with 64GB RAM, and an NVIDIA RTX3080 12GB GPU with 8,960 CUDA cores.
We still haven't written any code... although we do have an LLM, so that's one tick in teh box. The next step, whatever your programming religion, is to find a framework of some sort. Frameworks/SDKs expediate the progress one can make building a solution. Invariably they introduce some level of abstraction that you have to feel comfortable with. I had to go through three iterations over the course of two weeks, each time finding a limitation that led me to the next. From my N8N days I knew I had to ingest documents that contained domain expertise. I found some PDFs and Word documents. However, there's two routes to go with these documents in a RAG sense, but first, let's talk about how an LLM deals with a document.
Prompts and Tokens
There's the concept of a prompt. There's an input prompt that generates an output. A prompt might be
create an image of a chatbot
The LLM starts off by converting your text into tokens. Each LLM has a different tokenisation strategy. You can think of tokens as words, so that the above prompt has 6 tokens. Some token methodologies though might convert "chatbot" into two tokens; "chat" and "bot", so now our prompt has 7 tokens. You could tokenise down to the character level but that would be intense from a processing perspective.
Each token is assigned a value, and it's these values that are used as inputs to work through the neural network. In fact, the output from an LLM is by way of values/tokens. You can imagine that a few tokens in is likely to generate many more out, and that some processing took place by the LLM traversing the neural network. You'll see from the previous table, that any hosted LLM will charge a cost based on input tokens, and also output tokens. The cost associated with output tokens being higher to include the processing premium. Because LLMs require compute, their outputs are also constrained with a rate at which they can generate output tokens, so you'll also see that different services have different rates measured in tokens/second. As an example, GPT-5 (ChatGPT) can generate about 122 tokens/second.
Recommended by LinkedIn
Context
Given the prompt, we can also use it to preface our instruction/query with some context. For example
The chatbot always has a hat. Create an image of a chatbot
We've created some focus and context by stipulating that chatbots always have hats. In prompt engineering this is known as the context window. Different LLMs have different content window sizes. Some LLMs have a context window as small as 8k, whereas others have a size of 10million (!). What are the units... tokens of course. This table, once more, tells us about context windows. Now imagine, if we've got a context window of 10 million tokens then we could likely put the contents of several documents prefacing our prompt, and then, the prompt could query our context.
This is an example of RAG, in a fairly crude form. We'd have to pay for all the tokens in, the LLM would have all the work of tokenising the document text, and we could end up with a one word answer! A more plausible way is to tokenise the document beforehand. The steps to accomplish this are;
Time to code! We need to get documents into a Vector Database. I always look for frameworks to help with some level of abstraction and dependency injection. I went through a number of frameworks/SDKs over a two week period.
Coding in C# I've then constructed the code to watch a folder, and any file additions to that folder immediately get ingested, chunked up, embeddings created and stored. So, let's get back to the context window. If I asked an LLM
When did I work at Bloomberg?
It would have no clue because none of its training data includes this (at least I'm hoping not!). In the previous method I could prefix my CV question with the contents of my CV ;
2000-2003 worked at Organisation A, 2004-1010 worked at Company B, 2010-2022 Worked at Bloomberg. When did I work at Bloomberg?
But this costs more (input tokens) and takes longer. Now that I have a vector database, with my CV stored, what I can do now is;
Get the code to take the "When did I work at Bloomberg" prompt, create an embedding (vector) from it and then search the database for the closest matches? What's "close" mean though. Well during the comparison we use cosine distance/similarity which is a mathematical metric that measures how similar two vectors are by calculating the cosine of the angle between them. It's used to find vectors that are semantically related by comparing their directions, not their magnitudes. A score of 1 means the vectors are identical, 0 means they are unrelated, and -1 means they are opposite, and everything in between.
You can then choose the top one, two, three etc retrieve the text (chunk) associated with those results and then set the context window in the prompt before the query. In this over-simplified example, we'll probably end up with just the Bloomberg chunk, but imagine that there were 100 lengthy documents stored of all the domain knowledge for a company, and that you could easily query that dataset.
That's Retrieval-Augmented Generation (RAG) and we've covered prompts, LLMs, context windows, embeddings, cosine similarity, vector databases, chunking, tokens. I find this area of AI fascinating and would be curious to hear about any exciting areas you've seen it applied. In the financial services industry I could imagine use cases for invoices, contracts, licensing, usage, data sets, inventory.
and finally
Saying please and thank you to the LLM is costing someone money.
According to OpenAI CEO Sam Altman, expressing gratitude to or showing consideration for ChatGPT has cost the company “tens of millions of dollars.” He shared this in response to a user who pondered: “I wonder how much money OpenAI has lost in electricity costs from people saying ‘please’ and ‘thank you’ to their models.”
Dave Stone Great piece and well articulated the RAG journey. Thanks for posting.
Absolutely Dave
Great post Dave Stone