Skip-Gram based Models

Niraj Kumar, Ph.D.

Published Jan 22, 2017

The skip-gram model predicts the context words for the given single word". For this skip-gram model uses the "concept window" based constraints. Here, the window size of concept window determines how far forward and backward to look for context words to predict. For example, suppose we take a single word Wi and context window size 4, then the possible outcome may be W(i-2), W(i-1), W(i+1) and W(i+1). This prediction is not limited to the immediate context. We can train such models for multiple related tasks. For example, (1) Word embedding with some skippable distances [3], (2) extracting contextual semantics[1] (3) dependent context extraction [2], (4) Distributed multi-embeddings [4], etc. Most of such predictions have been achieved by making slight changes in the model, inputs and training instances.

Skip-gram model has shown better results in production w.r.t. some traditional language models. However, it requires a comparatively larger amount of memory to get better-predicted words than in N-gram model. According to [3], Skip-gram works well with small amount of the training data, represents well even rare words or phrases. Also, [3] enables efficient computation of word similarities through low-dimensional matrix operations.

Architecture and Function of the Traditional Skip-gram Model

Different from traditional artificial neural networks(ANN) based models, Its hidden layer nodes do not contain activation function (i.e., sigmoid activation function, Tanh, Relu or softmax). It just takes the summations of values from previous layer (input layer nodes) and passes it to next layer (traditionally output layer nodes). The following video tutorials contain the interactive demonstration of architecture and function of a traditional Skip-gram model.

Word2Vec-Skip-Gram (Part-1): This video tutorial contains: SKIP-GRAM based Architecture, including the introduction of the following architectural elements: (1) Forward Pass, (2) Error calculation and (3) Back-propagation.

Word2Vec-Skip-Gram (Part-2): This tutorial contains the Training and Test phases of Skip-Gram based architecture. The training phase (contains): (1) Forward Pass, (2) Error calculation and (3) Back-propagation.

Some interesting research related to Skip-gram based models

Word embedding with some skippable distances: According to [3], training instances can be created by skipping a constant number of words in its context. They applied skip-gram with negative sampling model to achieve the skippable word embeddings. Here, for example, the skippable word embeddings for any given word Wi may be W(i-4), W(i-4), Wi, W(i+3), (Wi+4) and so on.

Extracting contextual semantics[1]: This can be understandable through a simple old example but in the current context of word embedding: According to [1], all the occurrences of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”, which results in the word “bank” having an embedding that is approximately the average of its different contextual semantics relating to finance or river. The embeddings of word "bank" seems like different word embeddings in the discussed context and [1], uses a tensor layer to model the interaction of words and topics to handle such kind of embeddings.

Dependent context extraction: [2], used skip-gram based model to extract parser type dependencies (read Stanford Parser dependency) among important tokens of the given text.

Distributed multi-embeddings: [4], build multi embedding vectors to represent different meanings of any given word (especially effective with polysemous words like bank, star etc.).

Reference:

Liu, Pengfei, Xipeng Qiu, and Xuanjing Huang. "Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model." IJCAI. 2015.
Levy, Omer, and Yoav Goldberg. "Dependency-Based Word Embeddings." ACL (2). 2014.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119.
Tian, Fei, et al. "A Probabilistic Model for Learning Multi-Prototype Word Embeddings." COLING. 2014.
Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

Shiv Krishna Jaiswal 5y

If activation function is not used then would it not mean that resultant weight matrix = W1 X W2. How is defining only two metrics W and W' helpful? Why not more or less?

Dale Johnson 9y

Great detail there. It might be helpful if you described the meanings of the three outputs; I might have missed it, but I don't see that mentioned.

See more comments

To view or add a comment, sign in

Skip-Gram based Models

Niraj Kumar, Ph.D.

Architecture and Function of the Traditional Skip-gram Model

Some interesting research related to Skip-gram based models

Reference:

More articles by Niraj Kumar, Ph.D.

Others also viewed

DeepSORT Algorithm For Object Tracking

The Evolution of AI Architectures (ANN → CNN → RNN → Transformer)

Deep Generator: From a single digit to an image (code included)

How LLMs work — step by step (a friendly, "accurate" guide for non-experts)

Reimagining LLMs: Matrix Multiplication-free Language Model

Understanding Loss Functions in Statistics and Their Uses

Attention Mechanisms and Self-Attention: The Heart of Transformers

“KANs: Kolmogorov-Arnold Networks" [Liu et al., 2024]: Extended Abstract

Understanding LSTM: The Memory Powerhouse of Deep Learning

Variational Auto Encoder

Explore content categories

Architecture and Function of the Traditional Skip-gram Model

Some interesting research related to Skip-gram based models

Reference:

More articles by Niraj Kumar, Ph.D.

Part 2 | The Real Failure Mode of AI Coding Is Noisy Control

LLM Unlearning Is Not About Forgetting. It Is About Control or Governable Models.

SEO Is Not Dead—It Has Been Re-Architected by Generative Search.

From “Reasoning” to “Thinking”: Why Test-Time Compute Is the New Scaling Law

Why Hypergraph Multi-Agent Systems Are the Next Breakthrough in Agentic AI

Generative AI & LLMs: From First Principles to Agentic Intelligence

The Reasoning Stack: From Chain-of-Thought to Graph-of-Thought in LLM Systems and Beyond

🚨 45% of AI Code is Vulnerable — Stop Shipping Time Bombs

The Era of LLM Self-Optimization: Why We're Moving Beyond Manual Prompt Engineering?

This AI FIXES Its Own Mistakes?! Agentic LLMs & Self-Improving Prompts Explained Part-1

Others also viewed

DeepSORT Algorithm For Object Tracking

The Evolution of AI Architectures (ANN → CNN → RNN → Transformer)

Deep Generator: From a single digit to an image (code included)

How LLMs work — step by step (a friendly, "accurate" guide for non-experts)

Reimagining LLMs: Matrix Multiplication-free Language Model

Understanding Loss Functions in Statistics and Their Uses

Attention Mechanisms and Self-Attention: The Heart of Transformers

“KANs: Kolmogorov-Arnold Networks" [Liu et al., 2024]: Extended Abstract

Understanding LSTM: The Memory Powerhouse of Deep Learning

Variational Auto Encoder

Explore content categories