Deep learning models for search with semantics
Prereq.: Research papers: foundation of industry-academic...
Same single concept is often expressed using different vocabularies and language styles in documents and queries. Models such as latent semantic analysis (LSA) map a query to relevant documents at the semantic level where lexical matching often fails (e.g., microsoft office could allow remote code execution vs. welcome to the apartment office). All existing models view a query (or a document) as a bag of words and thus are ineffective in modeling its contextual structure (e.g., body fat % calculator vs. auto body repair estimate). Even context captured by models which learn topic distribution of a word within a document/ sentence through its co-occurence is coarse-grained for retrieval task. A latent semantic model based on neural network with convolution pooling (CNN) is proposed that views query/document instead as a sequence of words with structure and retains maximal context in its projected representation [1].
Determining semantic similarity (i.e., whether of same meaning) between texts is important for IR tasks (e.g., query suggestion). Existing approaches like handcrafted patterns and external sources (e.g., WordNet, wikipedia) of structured knowledge/ distributional semantics cannot be assumed available in all domains/ circumstances. Also, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length. Word embeddings (i.e., vector representations, computed from unlabelled data, that represent terms in semantic space) are proposed instead wherein vector proximity can be interpreted as semantic similarity [2].
Recent advances in deep neural networks (DNN) have demonstrated importance of learning vector-space representations of text, e.g., words/sentences for many NLP tasks (e.g., tagging, NER, semantic role labeling). Since these representations are usually in low dimensional vector space, they result in more compact models than those built from surface-form features. However, many such methods are based on unsupervised objectives (e.g., word prediction) which does not directly optimize desired task; others use supervision on a single task and are limited by amounts of training data. A multi-task DNN for learning representations across multiple tasks is proposed instead [3] not only leveraging large amounts of cross-task data, but also benefiting from a regularization effect that leads to more general representations to help tasks in new domains. This approach combines tasks of query classification and Web search ranking.
IR models based on simple bag-of-words, where each word denotes distinct dimension of a semantic space, allow only exact match – not semantically related words - contribute to relevance scores. Recent advances in word embedding (vector) have shown that semantic representations for words can be learned by distributional models. Based on this, a novel retrieval model is introduced by viewing match between queries/documents as a non-linear word transportation (NWT) problem. With this formulation, capacity (fixed for document and non-fixed for query) and profit of a transportation model is designed for IR task which in turn enables problem solution through pruning/ indexing strategies.
Query understanding and understanding user intent is the primary goal, once we get that we can look in query and item space for similarities based on similar intent and understanding. Deep learning, especially RNNs and word embeddings are very useful in this.
Do we see a clear path moving from syntactic to semantic to pragmatic? Making sense out of vector representations seems bit difficult when building solutions. For example I want to build a document classifier which can use vector space representations.