Designing a Q&A Agent

I am writing this because as I transition from my lab position to my new job, I need to purge my brain :) I have found that writing is the best way to get ideas out of my head. Otherwise, I cannot stop thinking about them.

I have been thinking a lot about creating a next-generation Q&A agent with the following constraints:

  1. Responses must be controlled. The agent should never reply with something that has not been reviewed beforehand by a human. If using template-based NLG, the keywords must come from a reviewed source.
  2. The agent should be able to answer questions with a US grade level readability of 15 years old including multi-sentence questions. This is required to run the agent in an environment where the users may not know they are asking an agent the question. For example, a thread where humans normally answer questions during the day, and the agent is turned on for nights and weekends.
  3. The goal is for the agent to answer questions alongside humans. This requires an agent that is capable of not responding to a question if it is not confident of the answer. Not answering is better than answering incorrectly. My personnel belief is that Q&A agents should be like calculators. A calculator does not require a personality; no one expects a calculator to imitate a human. The purpose of the dialogue manager is to guide the user towards wording the question in a manner the agent can understand.
  4. The agent answers questions about a closed domain (commonly called a closed domain agent). The domain is defined by a corpus in which the agent derives all its responses.
  5. Finally, the agent has to be easily adaptable for answering questions about different domains In the ideal architecture, the agent will extract answers from the corpus directly requiring no specific training.

The goal of a Question & Answering agent is to provide information to a user. The user enters a question in the form of text. The agent extracts the intent and entity from the question and uses the information to select the best response. Normally the entity is the object of the question, but that is not always the case. Of course, the agent must have data on the entity or it cannot answer the question. This significantly simplifies extracting the entity from the question. The difficult piece of information to extract from the question is the intent. As shown in the table below, there is more depth and ambiguity in extracting intent.

No alt text provided for this image

One of the simplest methods of extracting intent is question similarity (this is how IBM Watson and Amazon Lex work). The machine learning-based classifier (part of the Watson or Lex service) is trained before releasing the agent using supervised learning by feeding it labeled questions resulting in a model (this is a statistical approach). When the user asks the agent a question, the model classifies the question to the label with the most similar training questions. Question similarity can also be done without the use of machine learning. Instead of using a statistical approach, a search-based approach would search for the closest like question in the training question database during runtime.

No alt text provided for this image

The image above shows two, sentence similarity Q&A architectures: statistical model, embedding search. I have experimented with both. Both suffer from the same problem; the training questions have to be manually labeled. Originally we tested using word2vec and an averaging scheme. Later we used SBERT sentence embedding. The performance between the above techniques (statistical versus search)was not drastic enough to warrant a pivot in the architecture. I make no claim here because the sentence similarity using SBERT embedding was severely time-constrained so it may be possible to improve its performance. Also, I never had a chance to try the Google Universal Sentence Encoder. Although I have not found any documentation that claims any of these techniques work well with multi-sentence questions. In addition, these methods all require labeled training questions (supervised learning). In order to decrease the amount of human work required to create an agent, we need to find a method that does not require annotated training questions (requirements 4,5) or figure out how to generate the training questions automatically.

One method of automatically generating training questions is to use templates and entities. Using either a NER algorithm or a dependency parser/noun chunk extractor (SpaCy) extract the keywords from the corpus (TD-IDF may seem like an obvious choice but in practice, it did not work very well). Manually create a list of template training questions. Generate a training set by filling in the templates with the keywords. This method can be made to work, but there are a two critical variables: the entities found by the entity extractor and the distribution of your training questions templates.

No alt text provided for this image

Whatever algorithm Watson uses, it is very sensitive to having similar training questions with different labels (this makes sense). It is very important that the training questions be unique between labels. When manually creating the training questions this is actually easy to do. When automatically creating training questions there should be an additional algorithm that scans the training questions to make sure they are unique between labels including the removal of duplicate questions (Amazon Lex rejects duplicates questions when programming, Watson does not).

No alt text provided for this image

My first attempt at eliminating the training questions was a rule-based approach using dependency trees. SpaCy has a rule module that supports rules based on various token attributes including dependency. This experiment worked very well in terms of classifying questions to labels. Unfortunately, creating the rules-set was a time consuming manual task.

Reading comprehension is an exciting approach to question answering. Reading comprehension is an NLP task where a question is answered by returning a span of the corpus. Relative to the corpus, reading comprehension is an unsupervised or semi-supervised process. The idea is to give the algorithm a question and it responds with a portion of the corpus (called a span). Reading comprehension can be accomplished using many techniques but transformer-based and knowledge graph-based solutions are of the most interest to me.

The above links point to various reading comprehension algorithms and their performance on standard test datasets. The transformer approaches tend to be the top performers beating even humans in the case of SQuAD. The problem with using any of these algorithms as-is is they are designed to find an answer in relatively short text (a few paragraphs). Either an additional process must be added to break the corpus up into smaller parts or the algorithms need to be changed to support larger contexts. In addition, these algorithms are expecting single sentence (with some supporting dual sentence) questions. To support feature 2, multiple sentences must be supported. NewsQA is a dataset that tests for larger contexts and questions.

No alt text provided for this image

The most exciting approach is combining NLP with knowledge graphs. A knowledge graph is generated automatically from the corpus. The question queries the knowledge graph for a response. Auto generating knowledge graphs from text and querying the knowledge graph using natural language are active fields in NLP.

No alt text provided for this image

I am currently experimenting with the "rules" as shown in the above diagram. The goal is to encode a corpus into a knowledge graph and use the knowledge graph to find the answer to the question in the corpus. It's currently just an experiment.


Purge complete!


Below are some of the papers I reviewed for this article. I tried to highlight some interesting content from the papers. In most cases, I included a link to the PDF but certain sites are not allowed to be linked to.

Question type is defined as a certain semantic category of questions characterised by some common properties. The major question types are: factoids, list, definition, hypothetical, causal, relationship, procedural, and confirmation questions.

A factoid question is a question, which usually starts with a Wh-interrogated word (What, When, Where, Who) and requires as an answer a fact expressed in the text body.

A list question is a question, which requires as an answer a list of entities or facts. A list question usually starts as: List/Name [me] [all/at least NUMBER/some].

A definition question is a question, which requires finding the definition of the term in the question and normally starts with What is. Related to the latter is the descriptive question, which asks for definitional information or for the description of an event, and the opinion question whose focus is the opinion about an entity or an event.

A hypothetical question is a question, which requires information about a hypothetical event and has the form of What would happen if. A causal question is a question, which requires explanation of an event or artifact, like Why.

A relationship question asks about a relation between two entities.

A procedural question is a question, which requires as an answer a list of instructions for accomplishing the task mentioned in the question. A confirmation question is a question, which requires a Yes or No as an answer to an event expressed in the question.
typical pipeline Question Answering System consists of three distinct phases: Question classification, information retrieval or document processing and answer extraction.

Question classification is the first phase which classifies user questions, derives expected answer types, extracts keywords, and reformulates a question into semantically equivalent multiple questions. Reformulation of a query into similar meaning queries is also known as query expansion and it boosts up the recall of the information retrieval system.

Information retrieval (IR) system recall is very important for question answering. If no correct answers are present in a document, no further processing could be carried out to find an answer. Precision and ranking of candidate passages can also affect question answering performance in the IR phase. International Journal of Computer Applications (0975 – 8887) Volume 53– No.4, September 2012 2 Answer extraction is a final component in question answering system, which is the tag of discrimination[5].


Rule Based Question Answering Systems The rule based QA system is an extended form for IR based QA system. Rule Based QA doesn’t use deep language understanding or specific sophisticated approaches. A broad coverage of NLP techniques are used in order to achieve accuracy of the answers retrieved. Some popular rule based QA systems such as Quarc and Noisy channel generates heuristic rules with the help of lexical and semantic features in the questions. For each type of questions it generates rules for the semantic classes like who, when, what, where and Why type questions. “Who” rules looks for Names that are mostly Nouns of persons or things. “What” rules focuses on generic word matching function shared by all question types it consists of DATE expression or nouns. “When” rules mainly consists of time expressions only.“Where” rules are mostly consisting of matching locations such as “in”, “at’, “near” and inside. “Why” rules are based upon observations, that are nearly matching to the question. These Rule Based QA systems first establish parse notations and generate training cases and test cases through the semantic model. This system consists of some common modules like IR module and Answer identifier or Ranker Module.


question processing is divided in two main procedures.

The first one is to analyze the structure of the user’s question.

The second one it to transform the question into a meaningful question formula compatible with QA’s domain (Hamed and Ab Aziz, 2016).

Questions can also be defined by the type of answer expected. The types are factoid, list, definition and complex question (Kolomiyets and Moens, 2011).

Factoid questions are the ones that ask about a simple fact and can be answered in a few words (Heie et al., 2012), for instance, How far is it from Earth to Mars?.

List Question demands as an answer a set of entities that satisfies a given criteria (Heie et al., 2012), When did Brazil win Soccer World Cups? illustrates this point clearly.

Definition questions expect a summary or a short passage in return (Neves and Leser, 2015): How does the mitosis of a cell work? is a good illustration of it.

In contrast, Complex Question is about information in a context. Usually, the answer is a merge of retrieved passages. This merge is implemented using algorithms, such as: Normalized Raw-Scoring, Logistic Regression, Round-Robin, Raw Scoring and 2-step RSV(Garća-Cumbreras et al., 2012).


formation Retrieval QA: Usage of search engines to retrieve answers and then apply filters and ranking on the recovered passage.


Natural Language Processing QA: Usage of linguistic intuitions and machine learning methods to extract answers from retrieved snippet.


Knowledge Base QA: Find answers from structured data source (a knowledge base) instead of unstructured text. Standard database queries are used in replacement of word-based searches (Yang et al., 2015). This paradigm, make use of structured data, such as ontology. An ontology describes a conceptual representation of concepts and their relationships within a specific domain. Ontology can be considered as a knowledge base which has a more sophisticated form than a relational database(Abdi et al., 2016). To execute queries in order to retrieve knowledge from the ontology, structured languages are proposed and one of then is SPARQL.1
Hybrid QA: High performance QA systems make use of as many types of resources as possible, especially with the prevailing popularity of modern search engines and enriching community-contributed knowledge on the web. A hybrid approach is the combination of IR QA, NLP QA and KB QA. The main example of this paradigm is IBM Watson (Ferrucci et al., 2013)

Reading Comprehension (RC), or the ability to read text and then answer questions about it, is a challenging task for machines, requiring both understanding of natural language and knowledge about the world. Consider the question “what causes precipitation to fall?” posed on the passage in Figure 1. In order to answer the question, one might first locate the relevant part of the passage “precipitation ... falls under gravity”, then reason that “under” refers to a cause (not location), and thus determine the correct answer: “gravity”

Pundge, Ajitkumar M., S. A. Khillare, and C. Namrata Mahender. "Question answering system, approaches and techniques: a review." International Journal of Computer Applications 141.3 (2016): 0975-8887.

This system would not accept the link to a PDF of the above paper.

4.1 Linguistic Approach Linguistic approach understands natural language text, linguistic & common knowledge Linguistic techniques such as tokenization, POS tagging and parsing. These were implemented to user’s question for formulating it into a precise query that merely extracts the respective response from the structured database [10]. 
4.2 Statistical Approach Availability of huge amount of data on internet increased the importance of statistical approaches. A statistical learning method gives the better results than other approaches. Online text repositories and statistical approaches are independent of structured query languages and can formulate queries in natural language form. Mostly all Statistical Based QA system applied a statistical technique in QA system such as Support vector machine classifier, Bayesian Classifiers, maximum entropy models [20] [21].
4.3 Pattern Matching Approach Pattern matching approach deals with expressive power of text pattern, it replace the sophisticated processing involved in other computing approaches. Most of the pattern matching QA systems uses the surface text pattern, while some of them also rely on templates for response generator[20][21].
The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained Universal Sentence Encoder is publicly available in Tensorflow-hub. It comes with two variations i.e. one trained with Transformer encoder and other trained with Deep Averaging Network (DAN). The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.
In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
Typical machine reading comprehension task could be formulated as a supervised learning problem. Given the a collection of textual training examples {(pi , qi , ai)} n i=1, where p is a passage of text, and q is a question regarding the text p. The goal of typical machine reading comprehension task is to learn a predictor f which takes a passage of text p and a corresponding question q as inputs and gives the answer a as output, which could be formulated as the following formula [14]: a = f(p, q) (1) and it is necessary that a majority of native speakers would agree that the question q does regarding that text p, and the answer a is a correct one which does not contain information irrelevant to that question.
The relationship between question answering and machine reading comprehension is very close. Some researchers consider MRC as a kind of specific QA task [14, 56], and compared with other QA tasks such as open-domain QA, it is characterized by that the computer is required to answer questions according to the specified text. However, other researchers regard the machine reading comprehension as a kind of method to solve QA tasks. For example, in order to answer open-domain questions, Chen et al. [15] first adopted document retrieval to find the relevant articles from Wikipedia, then used MRC to identify the answer spans from those articles. Similarly, Hu [39] regarded machine reading as one of the four methods to solve QA tasks. The other three methods are rule-based method, information retrieval method and knowledge-based method.
Span prediction In a span prediction task, the answer is a span of text in the context. That is, the MRC system needs to select the correct beginning and end of the answer text from the context.


Named Entity Recognition (NER) is a key component in NLP systems for question answering, information retrieval, relation extraction, etc. NER systems have been studied and developed widely for decades, but accurate systems using deep neural networks (NN) have only been introduced in the last few years. We present a comprehensive survey of deep neural network architectures for NER, and contrast them with previous approaches to NER based on feature engineering and other supervised or semi-supervised learning algorithms. Our results highlight the improvements achieved by neural networks, and show how incorporating some of the lessons learned from past work on feature-based NER systems can yield further improvements.
Even though in some studies, MRC is referred to as question answering (QA), they are different in the following ways: - The main objective of QA systems is to answer the input questions, while in an MRC system, as its name indicates, the main goal is to understand natural languages by machines. - The only input to QA systems is the question, while the inputs to MRC systems are the question and the corresponding context that should be used to answer the question. For this reason, sometimes MRC is referred to as QA from text [4-6]. - The information sources that are used to answer questions in MRC systems are natural language texts; while in QA systems, the structured and semi-structured data sources such as knowledge-bases can be used besides the non-structured data like texts.
The approaches used for developing MRC systems can be grouped into three categories: rule-based methods, classical machine learning-based methods, and deep learning-based methods. The traditional rule-based methods use the rules handcrafted by linguistic experts. These methods suffer from the problem of the incompleteness of the rules. Also, this approach is domain specific where for any new domain, a new set of rules should be handcrafted ....
The second approach is based on the classical machine learning. These methods rely on a set of human-defined features and train a model for mapping input features to the output. Note that in classical machine learning-based methods, even though the hand-crafted rules are not necessary, feature engineering is a critical necessity. 
The third approach uses deep learning methods to learn features from raw input data automatically. These methods require a large amount of training data to create high accuracy models. Because of the growth of available data and computational power in recent years, deep learning methods have gained state-of-the-art results in many tasks. In the MRC task, most of the recent researches fall into this category. Two main deep learning architectures used by MRC researchers are the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN).
Factoid questions are questions that can be answered with simple facts expressed in short text answers like a personal name, temporal expression, or location. Non-factoid questions, on the other hand, have longer answers compared to the factoid questions.
The input context can be a single passage or multiple passages. It is obvious that as the context gets longer, finding the answer becomes harder and more time-consuming. Until now, most of the papers have focused on a single passage [18, 19, 29, 47-50]. But multiple passages MRC systems are becoming more popular [39, 45, 51, 52]. According to Table 2, only 4% of the reviewed papers have focused on multiple passages in 2016, but this ratio has reached 8% and 35% in 2017 and 2018, respectively
We introduce delft, a factoid question answering system which combines the nuance and depth of knowledge graph question answering approaches with the broader coverage of free-text. delft builds a free-text knowledge graph from Wikipedia, with entities as nodes and sentences in which entities co-occur as edges. For each question, delft finds the subgraph linking question entity nodes to candidates using text sentences as edges, creating a dense and high coverage semantic graph. A novel graph neural network reasons over the free-text graph—combining evidence on the nodes via information along edge sentences—to select a final answer. Experiments on three question answering datasets show delft can answer entity-rich questions better than machine reading based models, bert-based answer ranking and memory networks. delft’s advantage comes from both the high coverage of its free-text knowledge graph—more than double that of dbpedia relations—and the novel graph neural network which reasons on the rich but noisy free-text evidence.
COMET is an adaptation framework for constructing commonsense knowledge bases from language models by training the language model on a seed set of knowledge tuples. These tuples provide COMET with the KB structure and relations that must be learned, and COMET learns to adapt the language model representations learned from pretraining to add novel nodes and edges to the seed knowledge graph.


Awesome post, Eric! Thanks for sharing your expertise

Like
Reply

To view or add a comment, sign in

More articles by Eric Gregori

Others also viewed

Explore content categories