Knowledge vs. Data
I have been involved in several discussions recently regarding knowledge vs. data. Many researchers have been talking about both concepts in different scenarios without necessarily differentiating them. To me, there are some fundamental differences between knowledge and data.
Data are raw, unprocessed information, such as numbers, symbols, observations, or facts collected without inherent interpretation. For example, the clinical records in patients’ EHR.
Knowledge is processed information that has been interpreted, vetted, and understood in a way that supports explanation, prediction, or action. For example, metformin can treat diabetes.
Graph is a popular structure to represent both data and knowledge. The two basic elements that compose a graph are nodes and edges.
In a data graph, nodes are typically samples, edges usually encode the sample relationships (such as similarities). These nodes are usually characterized by feature vectors, and the edges can be weighted or unweighted. Inference on a data graph is typically transformed to mathematical operations on its adjacency matrix.
Recommended by LinkedIn
In a knowledge graph, nodes are semantic concepts which can have a bunch of semantic attributes, and edges are semantic relationships which are also associated with semantic attributes. Inference on knowledge graphs typically involves embedding both nodes and edges as vectors and then performing mathematical operations on those vectors.
Data graphs are data set dependent, which means for different data sets, the constructed data graphs are distinct from each other. Knowledge graphs should be data set independent, which means the knowledge encoded should be universally applied to any data set.
In recent years, with the popularity of LLMs and foundation models, the boundary between knowledge and data is becoming more and more blurred. This is because LLMs were mostly trained from text corpuses, where most of them were exactly the same sources for curating the knowledge, and these LLMs can make effective information interpretations and reasoning, which made people thought they can act as knowledge bases and the classical knowledge graphs were no longer needed. However, LLMs are black-boxes and it is challenging to tease out what are the exact knowledge they have encoded. One popular way is to perform “post-hoc” analysis, i.e., prompting LLMs and checking their responses. However, there is no guarantee that the information generated by LLMs are factual, i.e., these models can “hallucinate” and the information they generate do not always qualify as knowledge, unless they are rigorously vetted and proved to be true. From this sense, it is more appropriate to say LLMs can generate “hypotheses” rather than “knowledge” – only validated hypotheses can be called knowledge.
Clearly, curating knowledge is not an easy task, as intensive review and validation from domain experts is needed. The question is do we still need knowledge? My answer is yes. For example, they can help “ground” the LLM responses to be more factual (e.g., through the RAG process). This is particularly valuable given that we have more and more LLM generated information nowadays which are frequently difficult to judge true or false. In addition, these knowledge are heterogeneous and thus can help bridge distinct data modalities. From this perspective, the graph representation is convenient as it structuralizes both the concepts and their relationships within the knowledge. With the semantic web technologies developed, knowledge graphs are also efficient to query (e.g., through SPARQL) and convenient to visualize (e.g., through Neo4j). However, is graph the optimal way of representing knowledge – not necessarily, as there are lots of limitations for triplets representing intricate knowledge (such as the ones holding true under certain conditions). Free text is still the most flexible way of representing knowledge. However, its scale is too large and the way of textual representation of the same knowledge is too diverse, we need some level of knowledge abstraction. Both knowledge graph and LLMs are ways of abstracting the knowledge, one is explicit and transparent, the other is implicit and opaque. Maybe some hybrid form of marrying both would be the way to go.
In summary, data is data, knowledge is knowledge. Knowledge needs to be seriously vetted and factual, which are critical given more and more LLM generated information, and they can improve the quality and interpretation of data-driven inference.
Fei Wang hope all is well! Thanks for the write up! agreed that the boundary between data & knowledge is becoming less by LLMs! if we get meaningful knowledge (ie curated by experts) out of data and use textual representation for downstream LLMs, do we even need traditional knowledge graphs (ie nodes & edges) for inference? especially if textual representation is medically meaningful?
Fei — this is an excellent breakdown. One frame I’ve been working with is that we’re entering a third layer beyond data and knowledge: coherence. Data is raw. Knowledge is curated. But coherence is behavioral — it’s how well information, context, and intent stay aligned as systems reason in real time. LLMs today can retrieve facts, link concepts, and even generate hypotheses. What they struggle with is maintaining continuity: • stable intent across long reasoning chains • consistent semantic grounding across modalities • predictable behavior under shifting context • low-drift interpretations when data and knowledge conflict Knowledge graphs improve grounding. LLMs improve abstraction. But neither solves continuity under autonomy. As models, agents, and environments become more dynamic, I think the real frontier is ensuring coherence between data, knowledge, and behavior — the layer that keeps systems from drifting even when their inputs are correct. Knowledge answers “what is true.” Coherence answers “does the system stay true to it.” That’s where the next decade gets very interesting.
Agreed. Grounding, verifying, and explaining, follow-up reconciliation when explaining is still not enough…it’s needed if high-valued/high-stake questions still need to be asked
Thank you, Fei Wang. At Starbucks, we’re leveraging a knowledge graph to inject and ground menu data into our LLM for more accurate and contextual responses.
Knowledge in the data is the new needle in the haystack?