LLMs and Big Data - Let's clear some misconceptions
We see many companies advertising their new state of the art genAI capable systems. We are able to achieve wonders with generative AI, and create products that wows execs all around the world. What I am discussing here shouldn't be considers best practices, it is my own perspective based on what I have experienced in the market.
The reality is that most of the offerings out there will be from a known source behind the scenes. Not many companies has the resources to build and train their own model, and some just needs a product out of the door in record time. Here are a few problem statements:
The last one is a big pain point. Not only are LLMs some of the most expensive models to maintain and finetune, there are use cases where using these models are actually a massive overkill or just plainly overpriced if you intend to use it as-is. I am not going to discuss the first three problems since those are discussions by themselves.
Let's look at sentiment analysis. A key factor is that it only has three potential outputs, eg. Positive, Neutral or Negative. To get only the required output on an LLM you need to construct a prompt to get the limited output, else you will get a full paragraph. A key point in this instance is the missing text will still be generated within the AI in the first few passes, which is costly. Using a fit for purpose model in this case will give only the desired output and will be substantially lower in computing power and size to return only those three desired outputs.
The next big issue is how do you apply AI to big data analysis. In the current state we can't do it, well, not directly. Let me explain. AIs have limited input and output capacity. We refer to these as tokens. If you use a cloud based AI this is also the primary cost driver, but what most people don't take into account is that it cost both for input and output tokens. The whole science behind what a token is will cover another wall of text. Just know a token is not a single character, nor is it a word in some cases. Gemini can have tokens up to 1 million, but most can do 8K to 128K as standard. As you can imagine you can't ask it to analyze millions of records and give you some sort of summary. We cheat the system by feeding it chunks, but the overall picture might be skewed since context from one chunk to the next might not be coherent. Not to mention the cost of these tokens if you do this as a routine.
So what are my options?
That is a good question. To analyze big data passing it as raw data to an AI will not yield the required results. So there are a few options available:
Training your own model from scratch would give the most accurate results, but for real time reporting and cost of training this will generally not be an option.
Recommended by LinkedIn
A) Regression Models:
Predict continuous outcomes using one or more predictor variables.
B) Classification Models:
Categorize data into predefined classes.
C) Clustering Models:
Group similar data points together.
D) Graphical Models:
Represent and reason about variable relationships.
E) Deep Learning Models:
Use architectures like RNNs and CNNs for complex data tasks, such as time series or image analysis.
If your data is structured, eg. in a database table, you could consider the following model types to do proper analysis of the data:
I already over invested in an LLM offering and all set up, so I am not willing to change. What are my options to analyze large chunks of data?
The good news is that if you already heavily invested in LLMs instead of fit for purpose models you still have the ability to utilize them for big data.
From the above it is clear that AIs by themselves aren't a magic catch-all solution. It still need some work to make it work within your business. Sure it can help with auditing and check the content of documents, but even there it needs to be fact checked. However, when it comes to large chunks of data to be analyzed it will require some assistance to save on cost and extend beyond the model's capabilities.
It is important to map out what you plan to do with AI and there is a distinct possibility that you will use different AI models for different purposes. There aren't any rules that stipulate that you MUST use one model and one model only. Keep that in mind! For example, you might use Gemma3 4B to out and format JSON objects since you don't need it to write code, and use OpenAI for chats and document parsing.
Bottom line is to plan your use cases before starting to look at the AI solutions out there. See how it can benefit your business and what needs to be optimized. More importantly, AI is not infallible, check the results and check it again.
Great insights as always Ben Prinsloo - thanks for taking the time to write this down. Different tools for different jobs right? Structured Vs Unstructured data are two different beasts and should be tamed differently accordingly...