4 Technical Reasons why Data-Driven and Machine Learning NLU is a Myth
SETTING THE STAGE
First, the discussion that follows is concerned with natural language understanding (NLU), namely, the task of fully comprehending/understanding ordinary spoken language, much like humans do. We are not concerned here with what is inappropriately called natural language processing (NLP) but is really just text processing, and where text is treated as mere data (much like bits or pixels), and where data-driven approaches can perform what are essentially pattern recognition tasks (e.g., filtering, text classification, sequence-to-sequence (so-called) 'translation', search, clustering, etc.) with some degree of 'accuracy'. Therefore, all rebuttals to the arguments made here should only be related to true comprehension of everyday ordinary spoken language (the stuff you do on a daily basis communicating with your kids and people you engage in conversation with, and not the stuff you feel good doing at your desk - building so-called 'models' and processing strings of characters! That stuff might impress some 'AI hyped' executive, but it has nothing to do with language understanding.)
Second, we will present 4 arguments that clearly show how fallacious is the data-driven and machine learning approach to NLU are, ans these arguments re presented in no particular order, as any one of them is actually enough on its own to put an end to this futile effort.
Third, the unsound rebuttal that is based on something like "but could all of those doing data-driven and machine learning NLU be wrong?" will not receive any sympathy nor any attention. Suffice it to say that, yes, "all of those" could be wrong (remember that at some point in time almost all of humanity believed the earth to be flat?) As Bertrand Russell once famously said: "That an opinion is widely held is no evidence whatsoever that it is not utterly absurd".
So here we go.
FUNCTION WORDS
Above is the first reason why data-driven/ML approaches to NLU are a myth. The conclusion follows from two claims. Thus, for the above conclusion to be accepted we of course must demonstrate that (1) and (2) are true. First, let us demonstrate by two simple examples why function words are important - in fact, that in the final analysis they are what determines the final meaning. Consider the sentences in (3) and (4)
In (3) we have an example of the importance of function words (in this case the prepositions 'for' and 'about') in determining the final meaning of a sentence. Clearly, if we were to ignore the function words (and treat them as stopwords) then (3a) and (3b) would be saying the same thing, although these sentences refer to a different 'someone' as writing for the White House is clearly quite different from writing about the White House.
Regarding (4), we argue that a five-year old can easily infer that a person uttering (4a) really meant to say (4b); that is, although 'a' out-scopes 'every' in the surface structure of (4a) the real (intended!) meaning is one where 'every' out-scopes 'a' since, in the world we live in, it is highly unlikely that there is some house that is physically on every street in John's neighborhood. Clearly, then, depending on the final scope of the quantifiers in (4) we could be describing a different reality (or state of affairs). To really appreciate the importance of this issue, let us just say that this is one of the main problems why the dream of having natural language interfaces to databases (which would require translating an English query to a formal SQL query) has not yet materialized.
We believe that we have established point (1) - that function words are important - let us now look at (2). Data-drive/ML approaches cannot account for function words (e.g., 'the', 'every', 'since', 'must', 'not', 'although', 'for', etc.) because these words will disrupt all the probabilities in the model as these words do not have any significant or meaningful probability and this is so because the co-occurrence probabilities of function words are equal in all contexts. In other words, common function words will overwhelm the word distributions, leading to suboptimal results that are difficult to interpret. In fact, this is precisely why such approaches have invented the ridiculous term 'stopwords' (the vulgarity even went as far as claiming that these words have no meaning, although these words are what determines the final meaning). Nevertheless, what is important here is that data-driven/ML approaches cannot model and account for these words.
In conclusion, function words are important in determining the final interpretation of a sentence, and data-driven approaches cannot account for function words, thus: data-driven/ML models cannot account for meaning. Q.E.D.
Let us now move on to reason number 2.
STATISTICAL INSIGNIFICANCE
Again, our conclusion (which is also the same) is based on two claims. Let us consider the sentences in (5) and (6) to illustrate both of these claims.
It is a well established fact that the frequency of antonym/opposite words such as big/small or read/write highly correlate in similar contexts. Note now that the only difference between (5a) and (5b), as well as the difference between (6a) and (6b) are words that are known to co-occur in similar contexts with the same frequencies. As such, there is no statistical significance that can reliably be used to infer what 'it' and 'he' refer to in (5) and (6), respectively. In fact, the information needed to resolve the references in these sentences is not in the data but is (mutually) in the head of the speaker and hearer, namely that: if some x does not fit in some y then small(y) is more likely than small(x), and big(x) is more likely than big(y); and that if writing(x, thesis) then x is more likely to be the student than the professor, etc.
One might attempt the futile effort of trying to 'learn' all such patterns but that would be, essentially, an attempt at memorizing all possible utterances, which is neither cognitively nor computationally plausible (moreover, [1] has proved that all current models of ML cannot learn functions on infinite domains, and the semantics of natural language is an infinite function as it maps an infinite number of possible utterances to some meaning). But to see why processing more examples to find a statistically significant pattern won't help just consider that the most likely referent of 'it' changes if we replace 'did not fit' by the equally probable 'did fit' or 'because' by the equally probable 'although', or 'ball' by the equally probable 'trophy' or 'the suitcase' by the equally probable 'bag', etc. All these replacements - that would maintain all the probabilities in tact - also change the most likely referent of 'it' in (5), and so to learn how to resolve 'it' in (5) a data-driven/ML approach would have to see (memorize?) most (if not all) combinations which translates into something like 40,000,000 examples (and all of that just to learn how to resolve 'it' and in structures similar to those in (5), only).
The point here is this: statistical insignificance cannot be salvaged by a futile effort to memorize ('see') millions of examples for every simple syntactic structure. In summary: data-driven/ML approaches require statistical significance in the data, and in many natural language utterances that have different meanings there is no statistical significance (the difference in meaning is due to a mutually agreed-upon world knowledge and is not in the data), thus: data-driven/ML models cannot account for meaning. Q.E.D.
Now on to reason 3.
INTENSION (with an 'S')
Although √256 = 16, these two objects are the same as far as one attribute only, namely their arithmetic-value. But in general, objects like '16', '7 + 9', '√256', '6 + (2 * 5)', etc. have different intensions - aside from their extension/value, they differ in many other attributes (e.g., they have different number of operations, different number of operands, etc.) For the most part, these expressions could be considered to be the same in mathematical reasoning (e.g., in most applied sciences), but in language this is not the case. This in fact is the reason why we cannot conclude (8) upon hearing (7), although all we did is replace √256 by a value that is equal to it, namely 16.
The above discussion is not specific to arithmetic objects/expressions. Almost all natural language objects have an intension that cannot be equated with expressions that have the same extensional value. For example, the expressions in blue in the sentences below refer to the same objects/entities, yet replacing them freely in a sentence would result in absurd and contradictory statements.
The notion of intension is a lot more involved than what we said here. For example, it is the intension of a strike, and not an actual strike, that we refer to in a sentence such as 'the strike that was planned for Sunday was cancelled' because in actuality what was cancelled is the concept of a strike event, as the strike itself never happened. Needless to say, it is enough for now to know that intension refers to concepts and properties of concepts, and not just to actual values/objects, and data-driven/ML approaches are exclusively extensional - i.e., they operate in numerical values (or vectors/tensors of numerical values), and thus cannot model or account for one of the most important aspects of meaning, namely intensionality: thus, again, data-driven/ML models cannot account for meaning. Q.E.D.
Finally, let us look at reason number 4.
THE 'MISSING TEXT PHENOMENON' (MTP)
Consider the sentences in the table below. The sentences on the left hand side of '=>' are sentences that we usually say in our everyday discourse, and those that are on the right hand side are what is implicitly intended, and where the text in blue is the 'missing text' - text that is not explicitly stated but is implicitly assumed as shared background knowledge. Note that this phenomenon (that we call the 'missing text phenomenon - see [2]), is not an exception, but is the norm, and it is the source of many challenges in the semantics of natural language.
But if the 'missing text phenomenon' is rampant in natural language, then natural language is highly (probably, optimally) compressed, and the challenge in natural language semantics is actually in decompressing (uncovering) the missing text. But recent research has shown that learnability and compressibility are equivalent (see [3]): to learn a pattern from a data set, the data set must have redundancies and be compressible, and vice versa. But if natural language is highly (optimally) compressed, then it cannot be learnable. In other words, using what is essentially a compression technique, we cannot uncompress and 'uncover' what's not in the data in the first place. So, again, data-driven/ML models cannot account for meaning. Q.E.D.
FOUR REASONS ARE ENOUGH (FOR NOW)
Although we have given four technical reasons as to why data-driven/ML approaches to NLU are utterly fallacious, there are in fact several other reasons that are equally fatal. For example, data-driven/ML approaches are completely clueless in making inferences as to the (prior) presuppositions and (post) entailments that the understanding of a certain utterance entails. For example, full understanding of 'John quit graduate school and took a position at an AI startup' entails knowing that 'John was in graduate school' is presupposed and that 'John is now working in a startup' can now be inferred, etc.
CONCLUDING REMARK
Building systems that understand ordinary spoken language is perhaps an AI-Complete problem. It is certainly not at all 'just' a little more than text processing. In fact, these two tasks have very little (if anything at all) to do with each other. Some sanity is needed in the field, before we spend too much time and effort travelling along the wrong path.
REFERENCES
- Bringsjord, S. et. al. (2018), Do Machine-Learning Machines Learn?, In V. C. M¨uller (Ed.): PT-AI 2017, SAPERE 44, pp. 136–157, 2018, Springer.
- Saba, W. (2019), On the Winograd Schema: Situating Language Understanding in the Data-Information-Knowledge Continuum, In Proceedings of the 32nd International FLAIRS Conference, Sarasota, FL, May 2019, AAAI Press.
- Ben-David, S. et al. (2019), Learnability can be undecidable, Nature (Machine Intelligence), Vol. 1, January 2019, pp. 44-48.
Good collection of ideas and illustrations.
Three of the problems that I have not seen a good solution to in ML based NLU are: (1) implicature as in the difference in meaning between "I saw John and Mary having dinner at the cafe" versus "I saw both John and Mary having dinner at the cafe"; (2) Metaphor and Imagery: "I saw that new dancer, she is a leaf on the wind." or "He is a Greek God" and (3) Highly distorted real world utterance like a grunted "eiuhunoo" being understandable as "I don't know" These all require more than just the information encoded in the utterance to be interpreted meaningfully and require some sort of context for any kind of understanding. While ML can do a lot in terms of processing text in a very "vanilla" form, which is often all that is needed for many applications, and it does it very well; but to call it "understanding" is a bit of a stretch since it starts to fail when we move away from the so called textual garden path. The reason I see this as a problem is that if we want to talk about real world NLU that works everywhere, much of that real world dialogue is made up of things like metaphor, imagery, broken speech, regional accents and social implicature.
Does this article also take into consideration graph technology, or maybe I missed that part in between breaks?