Knowledge Representation and Graphs
Knowledge representation is a key foundation for AI, intimately intertwined with data structures. Retrieval augmented generative AI, RAG, provides one window into this. RAG augments an AI Large Language Model (LLM) by giving it a searchable database providing additional context to what the LLM has been trained on.
Consider an LLM trained at a particular time, for example 3 months ago. This LLM will have no knowledge of events after its training date, and will do poorly on questions about current events. Providing a RAG database of news articles over the last 3 months allows the LLM to search this database to provide context for questions about recent events. A common corporate use is to augment an LLM with non-public company data the LLM has not seen: proprietary code, SOPs, legal documents, etc.
The data in these examples is mostly text, though images and video are often relevant RAG data. Traditional relational databases are not well suited for these types of "unstructured" data. Most RAG systems use a vector database which represents text, images, and video as a vector, a list of numbers.
Dense and Sparse Vectors
Vector databases employ two complementary approaches to represent and search content.
Dense vectors (also called semantic or embedding vectors) capture meaning. A neural network encoder transforms text into a high-dimensional vector (typically 384–1536 dimensions) where semantically similar content clusters together in vector space. Searching with dense vectors finds conceptually related content even when exact words differ—a query about "heart attack" retrieves documents discussing "myocardial infarction" or "cardiac arrest."
Sparse vectors capture keywords. These are high-dimensional but mostly zeros, with non-zero values only for terms present in the text. Traditional methods like TF-IDF and BM25 weight terms by frequency and distinctiveness. Sparse search excels at precise term matching—finding exact product names, codes, or technical terminology that dense vectors might blur together.
Hybrid search combines both: dense vectors catch semantic relationships while sparse vectors ensure keyword precision. Modern RAG systems typically score and merge results from both approaches, leveraging their complementary strengths.
From Vectors to Graphs: RDF and Knowledge Representation
RAG using vector databases works well for representing text as data. However, language and the "knowledge" represented by that language have structure beyond numerical vector representations. GraphRAG uses a knowledge graph data structure as a RAG source.
Resource Description Framework (RDF), adopted as a W3C recommendation in 1999, and related standards like RDFS and OWL, developed well before modern LLMs, are providing powerful knowledge representations that extend the capabilities of LLMs through graph databases. RDF and related protocols attempt to standardize the ubiquitous, simple, but powerful subject-predicate-object structure of human thought and language.
RDF, RDFS, and OWL: Building Knowledge Graphs
RDF (Resource Description Framework) represents knowledge as triples: subject-predicate-object statements. Each triple is an atomic fact: "Aspirin treats Headache," "Aspirin is_a Drug," "Headache affects Head." Subjects and objects are nodes; predicates are edges. URIs uniquely identify resources, enabling graphs to link across datasets—the foundation of the semantic web.
RDFS (RDF Schema) adds vocabulary for defining classes and hierarchies. You can declare that "Drug" is a class, "Aspirin" is an instance of Drug, and "Analgesic" is a subclass of Drug. RDFS enables inference: if Aspirin is an Analgesic and Analgesics are Drugs, reasoners conclude Aspirin is a Drug.
Recommended by LinkedIn
OWL (Web Ontology Language) provides richer expressiveness for complex domains. OWL can specify cardinality constraints (a person has exactly one biological mother), property characteristics (if A is married to B, then B is married to A), and complex class definitions (a "Parent" is a Person who has at least one child). OWL ontologies enable sophisticated automated reasoning.
Together, these standards enable definition of a knowledge graph schema: the vocabulary of entity types (classes), relationship types, and constraints governing valid graph structure. The schema provides the blueprint; instance data populates actual nodes and edges conforming to that blueprint.
Hybrid Search: Combining Vectors and Graphs
RAG systems can represent the same text as dense vectors, sparse vectors, and a knowledge graph, enabling hybrid search that combines and synergizes the best of each. Vector search finds semantically relevant passages; graph traversal surfaces structured relationships and enables multi-hop reasoning that pure vector similarity cannot capture.
The Schema Challenge
A graph schema is critical for a useful knowledge graph. A schema defines the types of entities (things), their relationships, and the properties of these entities and relationships the graph can contain. For well-defined, focused applications, schemas can be manually defined. For example: Patient Notes Schema
For more comprehensive applications, like representing the complete knowledge graph of a real, evolving busines, creating and maintaining a realistic knowledge graph schema can be challenging and time consuming. Human language is messy, with subtleties like synonyms (different words referring to the same entity) or disambiguation challenges (the same word used for different entities). As a business evolves, new entities and relationships continually arise while existing ones change.
Automating creation and maintenance of an accurate, complete knowledge graph schema as the world changes is a more challenging AI task than resolving text and other resources according to a given, fixed knowledge graph schema. It's somewhat of a chicken-and-egg situation. A schema is needed to guide what types of entities and relationships should be extracted from documents to construct the knowledge graph. But appropriately creating this schema requires a deep understanding of the knowledge the graph should represent.
In practice, both for humans and AI, creating the schema and instances of the schema proceed in parallel discovery. AI knowledge graph construction systems including schema discovery exist, for example AutoSchemaKG. In addition to GraphRAG, knowledge graphs enhance and guide agentic AI systems, in particular providing powerfull intelligent memory. AI construction and use of knowledge graphs is both an active research area and growing vendor ecosystem, including Neo4j, Stardog, and Linkurious.
Here's an example of a knowledge schema and graph generated by neo4j's graph builder from a 35 page overview generated by ChatGPT deep research.
Linked data graphs also need to be grounded in ledgers that can attest to their provinance and authenticity to close this loop. Google "Hedera verified compute" to see what's happening here. Great work Steve, nice article!