LLMs and Big Data - Let's clear some misconceptions
Thank you Grok for my image! The content is my own

LLMs and Big Data - Let's clear some misconceptions

We see many companies advertising their new state of the art genAI capable systems. We are able to achieve wonders with generative AI, and create products that wows execs all around the world. What I am discussing here shouldn't be considers best practices, it is my own perspective based on what I have experienced in the market.

The reality is that most of the offerings out there will be from a known source behind the scenes. Not many companies has the resources to build and train their own model, and some just needs a product out of the door in record time. Here are a few problem statements:

  1. AI is one expensive hobby. If you want to self host a moderately powerful large language model (LLM) with an eye of a public AI offering, you will need a lot of processing and GPU backup. Most startups just cannot afford this.
  2. What happens with your data? Some cloud providers do retain queries, documents etc uploaded for your prompts.
  3. Almost all of them are large language models. A big problem is that some companies get stuck in what they feel is the right model, even though it doesn't fit their business model.

The last one is a big pain point. Not only are LLMs some of the most expensive models to maintain and finetune, there are use cases where using these models are actually a massive overkill or just plainly overpriced if you intend to use it as-is. I am not going to discuss the first three problems since those are discussions by themselves.

Let's look at sentiment analysis. A key factor is that it only has three potential outputs, eg. Positive, Neutral or Negative. To get only the required output on an LLM you need to construct a prompt to get the limited output, else you will get a full paragraph. A key point in this instance is the missing text will still be generated within the AI in the first few passes, which is costly. Using a fit for purpose model in this case will give only the desired output and will be substantially lower in computing power and size to return only those three desired outputs.

The next big issue is how do you apply AI to big data analysis. In the current state we can't do it, well, not directly. Let me explain. AIs have limited input and output capacity. We refer to these as tokens. If you use a cloud based AI this is also the primary cost driver, but what most people don't take into account is that it cost both for input and output tokens. The whole science behind what a token is will cover another wall of text. Just know a token is not a single character, nor is it a word in some cases. Gemini can have tokens up to 1 million, but most can do 8K to 128K as standard. As you can imagine you can't ask it to analyze millions of records and give you some sort of summary. We cheat the system by feeding it chunks, but the overall picture might be skewed since context from one chunk to the next might not be coherent. Not to mention the cost of these tokens if you do this as a routine.


So what are my options?

That is a good question. To analyze big data passing it as raw data to an AI will not yield the required results. So there are a few options available:

Training your own model from scratch would give the most accurate results, but for real time reporting and cost of training this will generally not be an option.

  • RAG Think of Retrieval-Augmented Generation (RAG) as a cheat. You will need two models and a vector database as intermediaries. The LLM's purpose in this regard will purely be to translate the result in human readable content, and a smaller LLM to vectorize the data into their number representation. This is useful for queries on all data, but there are limitations. With large datasets the intermediary might still be prompted to only bring back the top n results. You do have the benefit of searching a full dataset though. The caveat is that all your data has to be vectorized in a vector database. This is not a traditional row based database and warrants a writeup on it's own. To put it in simplistic terms is that data are converted into numbers, allow you to search for neighbors. This is a good method for knowledge articles etc.
  • Use traditional ML models (Fit for purpose) So what do I mean in this case? Let us break them down in a bit more detail:

A) Regression Models: 
Predict continuous outcomes using one or more predictor variables.

B) Classification Models: 
Categorize data into predefined classes.

C) Clustering Models: 
Group similar data points together.

D) Graphical Models: 
Represent and reason about variable relationships.

E) Deep Learning Models: 
Use architectures like RNNs and CNNs for complex data tasks, such as time series or image analysis.        

If your data is structured, eg. in a database table, you could consider the following model types to do proper analysis of the data:

  • Gradient Boosting Models (XGBoost, LightGBM): These are the industry standard for high performance on tabular data. They are highly accurate, efficient with large datasets, and excellent for predictive tasks like determining which factors most influence a particular outcome.
  • Random Forests: This is another powerful model that works well for classification and regression. It's also useful for identifying which columns (features) in your table are the most important.
  • Linear/Logistic Regression: For simpler tasks where you need a clear, interpretable understanding of the relationships between variables, these models are fast and effective.


I already over invested in an LLM offering and all set up, so I am not willing to change. What are my options to analyze large chunks of data?

The good news is that if you already heavily invested in LLMs instead of fit for purpose models you still have the ability to utilize them for big data.

  • Code generation Even though the LLM might not have enough tokens to access the data in it's entirety, it can still analyze the table structures and come up with code and SQL statements to extract the desired request.
  • MCP (Model Context Protocol) By creating MCP server functionality these services could extend the capabilities of the LLM. These services potentially have the abilities to hook into large datasets on behalf of the calling AI, rationalizing the data, and pass the result back to the AI in a structured object to process. This takes a lot of the processing away from the AI, and also tokens required. This by extension also reduces the overall cost. This will mean you need to invest and plan in what you need from the server.

From the above it is clear that AIs by themselves aren't a magic catch-all solution. It still need some work to make it work within your business. Sure it can help with auditing and check the content of documents, but even there it needs to be fact checked. However, when it comes to large chunks of data to be analyzed it will require some assistance to save on cost and extend beyond the model's capabilities.

It is important to map out what you plan to do with AI and there is a distinct possibility that you will use different AI models for different purposes. There aren't any rules that stipulate that you MUST use one model and one model only. Keep that in mind! For example, you might use Gemma3 4B to out and format JSON objects since you don't need it to write code, and use OpenAI for chats and document parsing.

Bottom line is to plan your use cases before starting to look at the AI solutions out there. See how it can benefit your business and what needs to be optimized. More importantly, AI is not infallible, check the results and check it again.

Great insights as always Ben Prinsloo - thanks for taking the time to write this down. Different tools for different jobs right? Structured Vs Unstructured data are two different beasts and should be tamed differently accordingly...

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories