Skip-Gram based Models
The skip-gram model predicts the context words for the given single word". For this skip-gram model uses the "concept window" based constraints. Here, the window size of concept window determines how far forward and backward to look for context words to predict. For example, suppose we take a single word Wi and context window size 4, then the possible outcome may be W(i-2), W(i-1), W(i+1) and W(i+1). This prediction is not limited to the immediate context. We can train such models for multiple related tasks. For example, (1) Word embedding with some skippable distances [3], (2) extracting contextual semantics[1] (3) dependent context extraction [2], (4) Distributed multi-embeddings [4], etc. Most of such predictions have been achieved by making slight changes in the model, inputs and training instances.
Skip-gram model has shown better results in production w.r.t. some traditional language models. However, it requires a comparatively larger amount of memory to get better-predicted words than in N-gram model. According to [3], Skip-gram works well with small amount of the training data, represents well even rare words or phrases. Also, [3] enables efficient computation of word similarities through low-dimensional matrix operations.
Architecture and Function of the Traditional Skip-gram Model
Different from traditional artificial neural networks(ANN) based models, Its hidden layer nodes do not contain activation function (i.e., sigmoid activation function, Tanh, Relu or softmax). It just takes the summations of values from previous layer (input layer nodes) and passes it to next layer (traditionally output layer nodes). The following video tutorials contain the interactive demonstration of architecture and function of a traditional Skip-gram model.
Word2Vec-Skip-Gram (Part-1): This video tutorial contains: SKIP-GRAM based Architecture, including the introduction of the following architectural elements: (1) Forward Pass, (2) Error calculation and (3) Back-propagation.
Word2Vec-Skip-Gram (Part-2): This tutorial contains the Training and Test phases of Skip-Gram based architecture. The training phase (contains): (1) Forward Pass, (2) Error calculation and (3) Back-propagation.
Some interesting research related to Skip-gram based models
Word embedding with some skippable distances: According to [3], training instances can be created by skipping a constant number of words in its context. They applied skip-gram with negative sampling model to achieve the skippable word embeddings. Here, for example, the skippable word embeddings for any given word Wi may be W(i-4), W(i-4), Wi, W(i+3), (Wi+4) and so on.
Extracting contextual semantics[1]: This can be understandable through a simple old example but in the current context of word embedding: According to [1], all the occurrences of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”, which results in the word “bank” having an embedding that is approximately the average of its different contextual semantics relating to finance or river. The embeddings of word "bank" seems like different word embeddings in the discussed context and [1], uses a tensor layer to model the interaction of words and topics to handle such kind of embeddings.
Dependent context extraction: [2], used skip-gram based model to extract parser type dependencies (read Stanford Parser dependency) among important tokens of the given text.
Distributed multi-embeddings: [4], build multi embedding vectors to represent different meanings of any given word (especially effective with polysemous words like bank, star etc.).
Reference:
- Liu, Pengfei, Xipeng Qiu, and Xuanjing Huang. "Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model." IJCAI. 2015.
- Levy, Omer, and Yoav Goldberg. "Dependency-Based Word Embeddings." ACL (2). 2014.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119.
- Tian, Fei, et al. "A Probabilistic Model for Learning Multi-Prototype Word Embeddings." COLING. 2014.
- Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
If activation function is not used then would it not mean that resultant weight matrix = W1 X W2. How is defining only two metrics W and W' helpful? Why not more or less?
Great detail there. It might be helpful if you described the meanings of the three outputs; I might have missed it, but I don't see that mentioned.