Evolution of Knowledge Graphs
Image by vecstock

Evolution of Knowledge Graphs

Natural evolution proceeds in fits and starts, sometimes resulting in progress, sometimes not. So does AI research.

In a recent [August, 2023] article, Luna Dong – one of the most visible and successful people building industrial-strength Knowledge Graphs at Meta, Amazon, and Google – offers an interesting and insightful characterization of the evolution of successive generations of knowledge graphs. There are very few people with this level of expertise working on knowledge graphs at web scale, so when she stops to consider what works and what doesn't, we need to listen carefully. Her view is also valuable because it seems to capture clearly how AI engineers as a group generally think about knowledge graphs and it describes quite accurately how methods for building knowledge graphs have evolved.

From an engineering perspective, knowledge graphs have evolved from entity-based knowledge graphs with clear semantics to "text-rich" graphs with more flexible but more ambiguous free text as entities to "dual neural" graphs that attempt to sidestep the explicit representation of semantic structure and rely instead on implicit relations as represented in embeddings. 

Let's unpack this description to understand these evolutionary steps in more detail.

Entity-based knowledge graphs 

Entity-based knowledge graphs are based on the "seed crazy idea" that we can get computers to model the world as we do:  in terms of entities with attributes and the relations between them. The nodes of the graph, then, are mostly named entity instances corresponding to distinct real-world individuals which are aggregated by the hand-crafted categories of entities (other nodes) in an ontology. 

The semantics of entity-based knowledge graphs are transparently defined in terms of the mappings from node or relation labels (strings) to real-world individuals, attributes, and categories. When the mappings between labels and real-world entities [i.e., between strings and things] are explicit and reliable, then the strings' semantics are clear. Q39729 has no semantics at all until we systematically associate it with the real-world individual named Jack Nicholson. For centuries, these mappings have been at the core of our understanding of what meaning and semantics are. 

Entity-based knowledge graphs enable the explicit, stored or computed, mappings between strings and things that constitute a machine-accessible semantics which is available for algorithms, so they are crucial for getting computers to model the world as we do.  

The key engineering challenge for building entity-based knowledge graphs is linguistic variability: the categories, relations, and instances are expressed with very diverse strings across different sources and languages, and this hinders data integration, making the mapping to things that much harder. Very much progress has been made in developing tools and systems for establishing these mappings at scale but more progress is needed deploying them to reduce the dependence on human experts – a key blocker for scalability.

Text-rich graphs 

Text-rich graphs are based on the "seed crazy idea" that we can mine and store semantic structure from unstructured or semi-structured source data alone, i.e., corpora of strings. Strings are informative surrogates for concepts in humans; so machines could also treat them as such.

Text-rich graphs evolved to address the problem of modeling domains where pre-existing resources with semantic structure are sparse and ambiguities are abundant, with vague and fluid semantic boundaries between values and even classes – i.e., most of the real world. The other motivation was scale:  many domains are both economically important and so vast that depending on slow, high-expertise human workers to provide explicit semantics is not seen as feasible. 

Examples given are domains like Products, Bioinformatics, Health, Law, and Events. Engineers argue that we cannot clearly model these domains with entities and relationships because there are millions of types and attributes and many of them overlap – not to speak of the massive variability of the strings we work with. Essentially, engineers are confessing to an over-dependence on the existing human-created knowledge resources that work so well when available. But all too often, engineers see the linguists and ontologists who created those essential knowledge resources as blockers rather than assets.  

Interestingly, there are ontologies and entity-based knowledge graphs for Products, Bioinformatics, Health, Law, and Events – lots of them! The past few years have seen huge investments by financial institutions, legal firms, industry associations, and healthcare organizations to create them at scale. I've built some of them myself – for products and services as well as parts of healthcare. That fact undermines a key motivation for text-rich graphs – we simply don't need text-rich graphs as substitutes for entity-based knowledge graphs.

But the more fundamental reason why text-rich graphs are maladaptive is because they have no semantics – no mappings between strings and things. The efforts to build text-rich graphs shifted focus to identifying and modeling graph-structured relations between strings and only strings:  nodes and attributes are mostly (uninterpreted) free text. Early transformer approaches and initial language models recognized no distinction between strings and things, and made no efforts to map one to the other. 

So evolving from entity-based knowledge graphs to text-rich graphs seems very clearly to be regressive evolution. In this stage of evolution, researchers seem to have skipped machine-accessible semantics entirely – the vast majority of features in these models are not attributes of real-world individuals, attributes, or categories but are other co-occurring strings, usually compressed into values of embeddings. This is why I hesitate to call their results knowledge graphs.  But developing text-rich graphs was not wasted effort; it was simply an incomplete solution for building graphs of conceptual knowledge.

"Dual-neural" Graphs

The "seed crazy idea" behind the next generation of graphs was that in some cases it may not be necessary to explicitly model knowledge or semantic structure at all. Instead we might be able to capture semantics implicitly through things like embeddings in LLMs. In an initial version of this scenario, some knowledge would be encoded in knowledge graphs, some in LLMs, and some in both. 

Things get confusing here. We enter strings into an LLM and get strings out in response. Then we understand the responses as relevant and meaningful.  This is because we base our assessments on our own human-accessible semantics; we understand what the strings mean or how they map to our concepts of things in the world.  

But many people jump to the conclusion that the machines are doing the same thing – understanding – just as we would conclude if we were talking with other human interlocutors. Text-rich graphs and early LLMs exploit the many correlations between strings to mimic understanding but do not have machine-accessible semantics. These models include no representation of concepts or of conceptual structure or of how patterns of strings (e.g., embeddings) map to these structures – they assume that it is not necessary to explicitly model knowledge at all. So clearly there was no understanding because there was no model of how strings map to our concepts or things in the world. 

Evolving from entity-based knowledge graphs to text-rich graphs and dual-neural graphs seems very clearly to be maladaptive – a continuation of regressive evolution. Mixing and merging text-rich graphs and text-only LLMs in different ways does not contribute to the survival and proliferation of machine-accessible semantics.  

Next Evolutionary Steps for Knowledge Graphs

Things are evolving dramatically with recent research. Entity-based knowledge graphs are being used to tame unruly LLMs and LLMs are becoming instrumental in creating and expanding knowledge graphs.

Explicit conceptual representations – in the form of entity-based knowledge graphs and ontologies – are being leveraged at each step of LLM training, tuning, and deployment. Entity-based knowledge graphs now serve as training data, inform loss functions, guide prompt optimization, filter outputs, and guide LLM construction, evaluation, and use in many ways.  Each time entity-based knowledge graphs are used, the performance of an LLM improves – because the LLM now has access to one kind of machine-accessible semantics: string-concept mappings. 

Attributes of real-world entities in the form of multimodal inputs are also being mapped systematically to strings, so multimodal LLMs have moved beyond modeling only string-string relations. These LLMs perform better because they build and have access to another kind of machine-accessible semantics: string-sensation mappings.

Now, because they build and leverage machine-accessible semantics, it seems reasonable to say that these newer systems really understand like humans do.  They may not always understand, but it is clear that these advances already constitute very significant evolutionary progress of knowledge graphs.

Saudades ; mike! Vc não volta mais ao Brasil?

Like
Reply

Alexander Rodrigues Silva, Neuza Árbocz esse artigo do Mike e também o artigo da Luna Dong que ele mencionou abordam os cases de grafos de conhecimento que vocês comentaram nas lives da série Tecnologias da Web Semântica :).

I've been observing with close interest for entire process of commercialization of products, and it's been good to see multiple old friends and colleagues build startups and products for leading companies. They've come a long way.

To view or add a comment, sign in

More articles by Mike Dillinger, PhD

Others also viewed

Explore content categories