Solving the ‘Same or not the Same’ problem: Probabilistic Entity Linking in Agriculture
Entity linking is a problem that has plagued the data integration domain for years – be it consolidation of CRMs, ERPs, or anything. At some point you are called to decide whether J. Bloggs is the same as Joe Bloggs… or perhaps Jason Bloggs?
The problem is common enough across the industries that even Amazon decided to release a standalone service (AWS Entity Resolution) to help businesses address their woes.
But as always, Agriculture presents certain unique challenges, even when it comes to a problem that has been solved across other industries. These challenges are particularly amplified when it comes to systems utilized directly by Growers, like Farm Management Systems.
In this post we will be examining these challenges, as well as how industry-established solutions can be expanded to solve them.
The Basics of Entity Resolution
Disclaimer – a lot of the illustrations in this article resemble a directed Graph. This is a personal preference. I believe that the Node-and-Edge representation is a great paradigm for storytelling purposes. It does not however dictate that any real implementation should be Graph-based (even though it surely could).
In the simplest terms possible, Entity Resolution is about comparing two data entities and deciding whether they are referring to the same physical entity:
Humans are very good at processing this kind of information and coming up with fairly accurate predictions on whether the two Spongebobs mentioned above are referring to the same physical person/sponge-based organism. Machines do so by looking for deterministic (aka exact) or probabilistic (aka fuzzy) matches – a set of predefined heuristics that help the machine calculate the probability of the two entities being the same.
Commercial solutions rely on the fact that data collection across most domains tends to present the same problems – misspelt names, badly formatted Zip Codes, phone numbers without area codes – and provide off-the-shelf components to address those. That’s why a commercial solution would make an easy job of matching the two entities above.
Once two or more entities have been linked (i.e. deemed to be the same), we usually define the Golden Record – i.e. the best, most complete, and most accurate version of an entity by bringing together all the information about that entity from all those different systems. In our Spongebob example, our Golden Record is a combined entity with the ‘best’ pieces of information from the two source entities:
So easy-peasy, right? Let’s take all this and apply it on Ag Data… let’s say, to link ERP and FMIS data to get a holistic view of a Grower – from their planning, to purchasing to application.
Setting the baseline Use Case
For storytelling purposes, we shall be utilizing a simple example – introducing Joe and Sam:
Now, let’s say that Joe wants to create a Customer 360 View for Sam. He has two data sources:
He wants to bridge these two sources so that he can compare what is being sold with what is being planned and applied. Are we leaving money on the table? Is there anything we could be doing to increase our sales? Let’s combine the datasets and find out!
Bit of a problem… Missing and/or obfuscated data
Enterprise systems like ERPs tend to be well-governed; there’s a value directly tied to having the data properly populated. Yet, on an FMIS there’s very little value or incentive in providing the same information in a similar fashion. The figure below aims to illustrate this: on the left, the barren FMIS landscape; on the right, the well looked-after ERP data:
This situation – which makes it difficult even for a human to decide on whether these two entities are the same – might appear dramatic, but it’s very realistic, for good reasons:
How could we start addressing this?
To give ourselves a chance of identifying whether these two entities are the same, we have to traverse our ontology and start looking for signals that would tie the two entities:
For example, we know the centroids of the Fields, and we know the Delivery Address of Invoiced products – perhaps we could calculate the distance between those and assign a link probability score (i.e. how likely it is that the Account and Business Partner entities are the same based on that parameter)?
How about comparing products that were applied to those that were purchased in the appropriate time frame? What’s the Link Probability Score of that?
Bit by bit, we could start adding (thin and potentially weak) relationships between the two entities:
In a court of law, each one of these Edges (i.e. lines between the nodes) would be considered Circumstantial Evidence – they don’t prove directly that the two entities are the same. However, a multitude of Circumstantial Evidence can eventually tell us the right story.
At the same time, not every Edge is equally important. Being able to match a government-issued identifier will always be more important than a Field centroid being close to the place where Seed was delivered.
Ultimately, any attempt to calculate the probability of the two entities being the same would involve a set of logical checks, each with its own weight and calculation logic. For example:
The logic for each rule would have to be implemented, allowing us to calculate a sub-score for each. Ultimately, we would like to apply Confidence Aggregation – i.e. adding up the strength of all these pieces of evidence. When enough clues line up, we become confident that the two records are the same. In essence, confidence aggregation formalizes the detective’s intuition of “the whole is greater than the sum of its parts” into a precise, quantitative decision-making tool for entity linking.
The simplest implementation would be Weighted Sums, which will be used here for illustrative purposes. It is far from perfect, but it is simple enough to showcase the logic of such operations. Multiplicative approaches can be used when we want a single strong match (like a government ID) to carry more weight than several smaller matches combined.
Now, let’s get back to our Sam Farms LLC example and see how it would play out (we assume that the Fields are close to the declared addresses, and some relevant product was purchased):
The Overall Confidence would simply be the Sum of the Contributions (0.92) divided by the Sum of the Weightings (4.4), leaving us with a “respectable” 0.21- meaning that there’s roughly 21% confidence that the two records refer to the same entity.
Recommended by LinkedIn
So, there you have it. We have linked the two records with a certain degree of confidence, albeit low…
Transitive Entity Linking
Introducing these ‘confidence edges’ between our nodes opens a few interesting capabilities – one of them being the ability to link entities through inferred relationships.
Let’s assume we are attempting to link entities from 3 different systems – an FMIS, an ERP and a CRM. We have defined two heuristic rulesets: one to link FMIS to ERP and one to link ERP to CRM:
The principle of Transitive Entity Linking is that when given the above, there’s surely a link between the FMIS and CRM entities. A link with its own Confidence Score.
Calculating that Score can be done in multiple ways – with the most straight forward one being a Simple Transitive Formula:
0.21 x 0.9 = 0.189 (18.9%)
This is intuitive but can under/overestimate scores, especially when the two links rely on different rulesets (like in the case above). More sophisticated probabilistic methods like Bayesian Inference would be more appropriate, but I will not embarrass myself by trying to walk through them. What is important is that we have acquired an inferred link without having to define and apply a specific ruleset:
In practice, adopting Transitive Linking in your solution would help you unearth potential relationships, without the upfront effort to define and maintain multiple rulesets.
Cluster-based Entity Linking
That’s not to say there isn’t value in defining that third ruleset. Let’s consider the scenario where linking has been performed across all three entities using distinct rules:
Looking at this as a human observer, the question is obvious:
If the CRM entity is almost definitely the same as the ERP entity (90% confidence)
And the CRM entity is most likely the same as the FMIS Entity (70% confidence)
Then surely, we are more than 21% sure that the FMIS entity is the same as the ERP one
That is because in this representation we are looking at the entities as a cluster, as opposed to individual pairs. The good news is that there are techniques that can be applied to help us adjust our confidence in such scenarios. For example:
No matter the approach, the reality remains that as we are building linking relationships, there are techniques that allows us to revise our understanding and confidence in these relationships.
Golden Records – or Dynamic Golden Views?
We’ve already introduced the concept of a Golden Record as “a combined entity with the ‘best’ pieces of information from the various linked entities”. This is a fundamental concept within the Master Data Management domain, with terms like Canonical Version or Single Source of Truth also being used to describe the concept.
However, there is a fundamental problem with this notion: for what purpose is the record considered Golden? The “Golden Entry Requirements” if our use case is legally enforced compliance would be a lot higher – but if we are talking about Marketing & Lead Generation, then perhaps the entry requirements are significantly lower.
In a nutshell, the traditional Golden Record concept lacks context-awareness. This is particularly problematic in domains where entity linking is difficult due to incomplete and obfuscated data – like in Agriculture.
Instead, and building on the concepts described up to this point, we can start talking about Dynamic Golden Views. They embrace contextual truth, where the definition of a valid link depends on business intent, risk tolerance and evidence quality.
The easiest way to visualize this is as follows:
Effectively, there can be multiple ‘link edges’ between two nodes. The rulesets that are applied for the definition of each edge can vary, depending on the Use Case we are trying to address. This approach ensures that decisions are made based on the best available evidence, aligned with the intended purpose, whether that's compliance, product research, marketing insight, or whatever.
It also means that we are able to deliver results earlier by prioritizing use-cases that allow for a lower confidence. This can mean we can see the benefits of such a system in our sales function, despite the fact that we have not yet finished all the components needed for compliance linking.
And now, for the practical bit…
At the beginning of this article, I added a disclaimer that the Graph-like illustrations (i.e. the use of nodes and edges) are coincidental and do not imply that Graph technologies are required for implementation. That was partially true: The very nature of what we are trying to achieve lends itself perfectly to graph technologies.
Even though the commercial MDM market can be considered mature it’s only with relatively recent developments around Graph technologies that we are seeing new entrants with claims that could potentially cater to our stated needs (dynamic golden views, graph based linking analysis etc.).
Solutions like Reltio pop-up frequently during desk research, and whilst I cannot vouch for them personally, they certainly appear to be aligned with the thinking described here.
Nevertheless, when choosing a way forward – be it vendor selection or implementation from scratch – there are certain aspects beyond entity linking that must be considered. Namely:
It would be naïve to think that any of this is trivial – however, I am personally convinced that this approach is the right one for unlocking the full potential of siloed Agriculture data.