Deep Misconceptions and the Myth of Data-Driven Language Understanding
Towards Putting Logical Semantics Back to Work
Early efforts to find theoretically elegant formal models for various linguistic phenomena did not result in any noticeable progress, despite nearly three decades of intensive research (late 1950’s through the late 1980’s ). As the various formal (and in most cases mere symbol manipulation) systems seemed to reach a deadlock, disillusionment in the brittle logical approach to language processing grew larger, and a number of researchers and practitioners in natural language processing (NLP) started to abandon theoretical elegance in favor of attaining some quick results using empirical (data-driven) approaches.
All seemed natural and expected. In the absence of theoretically elegant models that can explain a number of NL phenomena, it was quite reasonable to find researchers shifting their efforts to finding practical solutions for urgent problems using empirical methods. By the mid 1990’s, a data-driven statistical revolution that was already brewing overtook the field of NLP by a storm, putting aside all efforts that were rooted in over 200 years of work in logic, metaphysics, grammars and formal semantics.
We believe, however, that this trend has overstepped the noble cause of using empirical methods to find reasonably working solutions for practical problems. In fact, the data-driven approach to NLP is now believed by many to be a plausible approach to building systems that can truly understand ordinary spoken language. This is not only a misguided trend, but is a very damaging development that will hinder significant progress in the field.
To this end, we are in the process of writing a monograph that we hope will be a step towards starting a sane and an overdue semantic (counter) revolution.
In this monograph:
WE WILL ARGUE THAT many language phenomena are not learnable from data because (i) in most situations what is to be learned is not even observable in the data (or is not explicitly stated but is implicitly assumed as ‘shared knowledge’ by a language community); or (ii) in many situations there’s no statistical significance in the data as the relevant probabilities are all equal
WE WILL ARGUE THAT purely data-driven extensional models that ignore intensionality, compositionality and inferential capacities in natural language are inappropriate, even when the relevant data is available, since higher-level reasoning (the kind that’s needed in NLU) requires intensional reasoning beyond simple data values.
WE WILL ARGUE THAT the most plausible explanation for a number of phenomena in natural language is rooted in logical semantics, ontology, and the computational notions of polymorphism, type unification, and type casting; and we will do this by proposing solutions to a number of challenging and well-known problems in language understanding.
Part of the monograph can be downloaded here
I am well aware I am repeating myself, but I want to point out again Ken Church makes this very point in A PENDULUM SWUNG TOO FAR. As one of he scientists who started the empirical revolution, he presents some second thoughts in this paper. "We believe, however, that this trend has overstepped the noble cause of using empirical methods to find reasonably working solutions for practical problems." I believe we may want to mix solutions to some linguistic questions with statistical methods. The journal Linguistic Issues in Language Technology generally addresses this kind of question.
Natural language is like algebra and a programming language: In natural language, knowledge and logic are combined, in the same way as constants and variables are combined with symbols (and functions) of logic in algebra and programming languages. In natural language, keywords – mainly nouns and proper nouns – provide the knowledge, while words like definite article “the”, conjunction “or”, basic verb “is/are”, possessive verb “has/have” and past tense verbs “was/were” and “had” provide the logical structure. However, in knowledge technology, the logical structure of nature is ignored, because scientists are ignorant of this structure that is provided by nature. Instead of using this natural structure, keywords are linked by an artificially created structure (semantic techniques). Hence the struggling of this field to grasp the deeper meaning expressed by humans, and to automatically construct readable sentences from derived knowledge. In other words, this field has a blind spot on the conjunction of logic and language: A science integrates its involved disciplines. However, the field of AI and knowledge technology doesn't. It is unable to integrate (automated) reasoning and natural language: • Reasoners (like Prolog) are able to reason, but their results – derived knowledge – can't be expressed in readable and automatically constructed sentences; • Chatbots, Virtual (Personal) Assistants and Natural Language Generation (NLG) techniques are unable to reason logically. They are only able to select human-written sentences, in which they may fill-in keywords; • Controlled Natural Language (CNL) reasoners are very limited in integrating both disciplines. They are limited to sentences with present tense verb “is/are”, and don't accept words like definite article “the”, conjunction “or”, possessive verb “has/have” and past tense verbs “was/were” and “had”. I have a long-term solution. I have knowledge, experience, technology and results that no one else has. I know what autonomous reasoning is, and how to implement it in software. But most parts are in direct conflict with the current scientific opinion on natural intelligence and natural language. Autonomous reasoning requires both natural intelligence and natural language. Without knowing, scientists have applied natural intelligence to natural language at least 200 years ago, by describing reasoning constructions through Predicate Logic (algebra). Later, these reasoning constructions were implemented in Controlled Natural Language reasoners, which are able to derive new knowledge from previously unknown knowledge, both expressed in readable sentences with limited grammar. For example: > Given: “John is a father.” > Given: “Every father is a man.” • • Logical conclusion: < “John is a man.” However, the same conclusion – but in the past tense form – is not described in any scientific paper, because scientists "forgot" to define algebra for past tense: > Given: “John was a father.” > Given: “Every father is a man.” • • Logical conclusion: < “John was a man.” Another example: The intelligent function in language of conjunction “or” is not described in any scientific paper. So, there is no technique available to generate the following question through an algorithm: > Given: “Every person is a man or a woman.” > Given: “Addison is a person.” • • Generated question: < “Is Addison a man or a woman?” My algorithm: • Conjunction “or” has the logical function (Exclusive OR) in language to separate knowledge; • Given “Every person is a man or a woman” and “Addison is a person”; • Substitution of both sentences: “Addison is a man or a woman”; • Conversion to a question: “Is Addison a man or a woman?”. I am using fundamental science (logic and laws of nature) instead of cognitive science (simulation of behavior): • Autonomous reasoning requires both intelligence and language; • Intelligence and language are natural phenomena; • Natural phenomena obey laws of nature; • Laws of nature (and logic) are investigated using fundamental science. Using fundamental science, I gained knowledge and experience that no one else has: • I have defined intelligence in a natural way: http://mafait.org/intelligence/; • I have discovered a relationship between natural intelligence and natural language: http://mafait.org/intelligence_in_grammar/, which I am implementing in software; • And I defy anyone to beat the simplest results of my Controlled Natural Language reasoner in a generic way (from natural language, through algorithms, to natural language): http://mafait.org/challenge/. It is open source software: http://mafait.org/download/. So, everyone is invited to join.
Walid, nice to see something this great and in-depth! Impressed with your passion on NLP even after so many years
The physicist and part-time philosopher Boltzman once said "In den Werken der Philosophen ist viel Gutes und Richtiges enthalten, vor allem da, wo sie auf andere Philosphen schimpfen. Was sie selbst hinzufügen, hat diese Eigenschaft meist nicht." ("In the works of philosophers you can find a great many good and correct things, in particular where they rail against other philosophers. What they themselves contribute, oftentimes does not share these qualities."). As similar thing I can find here: Many of the criticisms in the monograph fragment are perfectly in order (though many of the criticised shortcomings are based on ignoring concepts going back to Hempel, Frege and before), but I fail to see the grand picture of where the monograph is driving at other than to hit (correctly and neccessarily) flawed approaches. Looking out for the final version. (Hope it will contain less typos like "compostionality", "words-sense disambiguation" or "This phenomena")
Uploaded a new version Feb. 26, 2017 that included: a section that shows how the proposal of ontological semantics solves the so-called 'Paradox of the Ravens' and a detailed section of how our proposal deals with lexical (or word-sense) disambiguation.