Transformer Chronicles: Cracking the Code - The Secrets of Encoder-Decoder Architecture

Munene Mutuma

Published Oct 23, 2024

In the previous article we laid the foundations of sequence processing using RNNs and CNNs. We highlighted some of the short comings. In this article we will dive into an aspect mentioned in that previous article: Encoder-Decoder Architecture

In sequence-to-sequence models and processing, this architecture is very crucial. Sequence-to-sequence tasks include translation, summarization, speech recognition, etc. In brief, the encoder represents data in a representational format and the decoder takes that representation and processes the output based on that.

Encoder

The idea behind an encoder is to figure out essential features in a given sequence. Instead of memorizing it, the encoder encodes the input into a representational vector, called a latent vector.

This vector is fixed in size and represents key features in the input. This is one key attribute of the encoder-decoder architecture. You may have noted that sequences vary in length. A sentence may be three words or ten words. An input may be three sentences or hundreds of sentences. The most crucial thing is that the encoder represents that input into a fixed-size vector. This takes care of the variable-length inputs.

For an RNN-based encoder, the final hidden state at the end of the input sequence is used as the latent vector while for a CNN-based encoder, it is the final output after convolutions, activations and pooling.

Article content — Fig 1. Encoder Decoder Architecture for Image Processing.

Latent Vector

A latent vector is usually a fixed-sized high-dimensional vector that represents an input. It is independent of the size of the input sequence. This representation captures key features and is very crucial for the encoder-decoder architecture as it allows for variable-length input. It is also simple and reduces the computational load in the processing. Think of it as a fixed-length compressed summary of the input, much like that of an article.

Recommended by LinkedIn

🔬 The Architecture Debate: Why Mamba Can't Kill the…

Kushwanth T 6 months ago

HydraNets: Specialized Dynamic Architectures for…

Hitesh Jhamtani 3 years ago

Attention Is All You Seek—A Performer-Based…

Larry Knibb 4 months ago

However, it can bring in an informational bottleneck where reducing an entire sequence to a smaller fixed-size representation can lead to the loss of important information especially where the sequences are longer. We shall look at how the transformer also handles this challenge.

Decoder

The decoder generates the output a step at a time using the latent vector generated by the encoder. In the first step, the decoder takes the context vector as its initial hidden state. The succeeding steps are autoregressive where the decoder depends on its previous hidden state and its previous generated token. However, in some cases the previous generated token is not used but rather the actual correct previous token is used. This is known as teacher forcing. After each generation, the hidden state is updated correspondingly.

The decoder and encoder are separate networks but connected by the sharing of the latent vector, where the final hidden state of the encoder forms the initial hidden state of the decoder. Sometimes, the encoder and decoder may vary in dimensions which means that the final hidden state of the encoder cannot fit directly as the initial hidden state of the decoder. In such a case, a transformation is introduced by the means of a fully connected dense layer that shifts the dimension of the hidden state from that of the encoder to that of the decoder.

Challenges with the traditional encoder-decoder architecture.

The traditional encoder-decoder model faces the following challenges:

Sequential Processing - this is similar as discussed in the previous article. The processing of information has to be sequential which inhibits leveraging parallelism.
Vanishing gradient problem - The problem of the gradient becoming smaller and smaller as it is propagated back to earlier parts of the model which leads to poor training especially for longer sequences.
Lack of bidirectional context in decoders - For sequences, understanding both previous and future tokens can be helpful in predicting/generating the relevant token at a given time. The lack of this may limit the architecture. To solve this, bidirectional RNNs may be used.
Informational Bottleneck in the Latent Vector - This is a new issue that arises in this architecture where the latent vector struggles to capture enough significant details especially for longer sequences.

The challenges faced by the encoder decoder architecture are inherited from the RNN (or CNN if used) and thus the limitation of sequential processing still remains unsolved for RNN-based encoder-decoder models. The problem of vanishing gradient still does not vanish but remains and that strongly. The lack of a bidirectional context limits a complete understanding of the data. But in all this the biggest bottleneck introduced is by the latent vector, which though powerfully being able to represent any-length of input in a fixed length, struggles to retain all nuanced details especially for longer sequences.

To overcome this, next we will explore a novel idea that was, for a long time included in the RNN-based encoder-decoder architecture, but in the "Attention is All You Need" it completely replaces the RNN and does away with recurrence. Join me next in the Transformer Chronicles as we turn our attention to explore this groundbreaking advancement: attention mechanism.

Transformer Chronicles: Cracking the Code - The Secrets of Encoder-Decoder Architecture

Munene Mutuma

Encoder

Latent Vector

Recommended by LinkedIn

Decoder

Challenges with the traditional encoder-decoder architecture.

More articles by Munene Mutuma

Others also viewed

🌌 Quantum Architecture — Part 11

Analytical architecture evolution - part 4

Provable Architecture via Contracts and Tests

Mastering Mistral AI

Exploring YOLOv8 Architecture: A Leap Forward in Real-Time Object Detection

LLM reasoning through transformer

Quantum Architecture — Part 10 ψ — THE COHERENCE VECTOR

AI This Week: Revolutionary xLSTM Architecture Takes on Transformers!

The Transformer - Model Architecture

Transformers Demystified: How LLMs and Agents Actually Work

Explore content categories

Encoder

Latent Vector

Recommended by LinkedIn

Decoder

Challenges with the traditional encoder-decoder architecture.

More articles by Munene Mutuma

Transformer Chronicles: The Rise of Transformers

Others also viewed

🌌 Quantum Architecture — Part 11

Analytical architecture evolution - part 4

Provable Architecture via Contracts and Tests

Mastering Mistral AI

Exploring YOLOv8 Architecture: A Leap Forward in Real-Time Object Detection

LLM reasoning through transformer

Quantum Architecture — Part 10 ψ — THE COHERENCE VECTOR

AI This Week: Revolutionary xLSTM Architecture Takes on Transformers!

The Transformer - Model Architecture

Transformers Demystified: How LLMs and Agents Actually Work

Explore content categories