Entity linking in commerce queries
Prereq.: Research papers: foundation of industry-academic...
May be relevant:
Product entity synonym discovery
Product mention recognition in social forums
Search queries provide a large amount of information, which can be organized and converted into a knowledge base for use in downstream applications (e.g., parsing, coreference resolution and entity linking). [1] focuses on the problem of automatically identifying - rare yet useful (e.g., online advertising) - brand and product entities from a large collection of Web queries in online shopping domain. An unsupervised approach based on adaptor grammars is proposed which does not require any human annotation efforts nor rely on any external resources (e.g., IMDB, DBpedia). To reduce noise and normalize query patterns, a standardization step is introduced which groups multiple search patterns and word orderings together into their most frequent ones. Three different sets of grammar rules are presented to infer query structures and extract brand and product entities.
Conventional automatic techniques use large corpora (e.g., news articles) to learn entity types (e.g., person, movie or place). Such text corpora though focus on general knowledge about entities which makes it difficult to satisfy those users with specific and personalized needs. Query logs, which contain billions of entities, help mine word patterns and click-throughs not found in text corpora thus providing a complementary source for discovering entity types based on user behaviors. [2] tackles following challenges in this regard: (1) queries are short texts, and information related to entities is usually very sparse; (2) large amounts of irrelevant/noisy information exists in search logs. Query logs are first modeled using a bipartite graph with entities and their auxiliary information, such as contextual words and clicked URLs. Then a graph-based framework called(ELP:Ensemble framework based on Label Propagation) is proposed to learn both types of entities and auxiliary signals. In ELP, two separate strategies are designed to fix problems of sparsity/noise in query logs.
Identifying and disambiguating entity references in queries is a key enabler for semantic search. Challenges in this regard are limited context which query provides coupled with time constraint of an online setting. This hampers ability to understand searcher’s intent and provide relevant/focused response. Supervised methods are expected to yield high effectiveness coupled with lower efficiency while for unsupervised approaches it is the other way around. Following features are needed for entity linking: (i) contextual similarity between a candidate and surrounding text of mention, and (ii) interdependence between all entity linking decisions in text (extracted from underlying KB). [3] strikes a balance between effectiveness and efficiency by employing supervised learning for entity ranking, while tackling disambiguation with a simple unsupervised algorithm. Experimental results/analysis leveraged is that high-quality results from ranking is much helpful - and entity interdependency in contrast - very little - in disambiguation.
Search engines are the closest available substitute for world knowledge that is required for solving complex natural language understanding tasks. [4] piggybacking on such an engine alleviates noise and irregularities in query language - characterized by misspellings, unreliable tokenization, word order and capitalization - thus putting queries in a larger context in which it is easier to make sense of them. Key underlying algorithmic idea is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in input query. This allows confining possible concepts pertinent to query to only ones really mentioned in it. Link-back is implemented via a collective disambiguation step based upon a supervised ranking model that makes one joint prediction for annotation of complete query optimizing directly F1 measure. Both known features (e.g., semantic relatedness among entities, word embeddings) and several novel features (e.g., approximate distance between mentions/entities (which can handle spelling errors)) are evaluated.