Transformer Chronicles: Cracking the Code - The Secrets of Encoder-Decoder Architecture
In the previous article we laid the foundations of sequence processing using RNNs and CNNs. We highlighted some of the short comings. In this article we will dive into an aspect mentioned in that previous article: Encoder-Decoder Architecture
In sequence-to-sequence models and processing, this architecture is very crucial. Sequence-to-sequence tasks include translation, summarization, speech recognition, etc. In brief, the encoder represents data in a representational format and the decoder takes that representation and processes the output based on that.
Encoder
The idea behind an encoder is to figure out essential features in a given sequence. Instead of memorizing it, the encoder encodes the input into a representational vector, called a latent vector.
This vector is fixed in size and represents key features in the input. This is one key attribute of the encoder-decoder architecture. You may have noted that sequences vary in length. A sentence may be three words or ten words. An input may be three sentences or hundreds of sentences. The most crucial thing is that the encoder represents that input into a fixed-size vector. This takes care of the variable-length inputs.
For an RNN-based encoder, the final hidden state at the end of the input sequence is used as the latent vector while for a CNN-based encoder, it is the final output after convolutions, activations and pooling.
Latent Vector
A latent vector is usually a fixed-sized high-dimensional vector that represents an input. It is independent of the size of the input sequence. This representation captures key features and is very crucial for the encoder-decoder architecture as it allows for variable-length input. It is also simple and reduces the computational load in the processing. Think of it as a fixed-length compressed summary of the input, much like that of an article.
Recommended by LinkedIn
However, it can bring in an informational bottleneck where reducing an entire sequence to a smaller fixed-size representation can lead to the loss of important information especially where the sequences are longer. We shall look at how the transformer also handles this challenge.
Decoder
The decoder generates the output a step at a time using the latent vector generated by the encoder. In the first step, the decoder takes the context vector as its initial hidden state. The succeeding steps are autoregressive where the decoder depends on its previous hidden state and its previous generated token. However, in some cases the previous generated token is not used but rather the actual correct previous token is used. This is known as teacher forcing. After each generation, the hidden state is updated correspondingly.
The decoder and encoder are separate networks but connected by the sharing of the latent vector, where the final hidden state of the encoder forms the initial hidden state of the decoder. Sometimes, the encoder and decoder may vary in dimensions which means that the final hidden state of the encoder cannot fit directly as the initial hidden state of the decoder. In such a case, a transformation is introduced by the means of a fully connected dense layer that shifts the dimension of the hidden state from that of the encoder to that of the decoder.
Challenges with the traditional encoder-decoder architecture.
The traditional encoder-decoder model faces the following challenges:
The challenges faced by the encoder decoder architecture are inherited from the RNN (or CNN if used) and thus the limitation of sequential processing still remains unsolved for RNN-based encoder-decoder models. The problem of vanishing gradient still does not vanish but remains and that strongly. The lack of a bidirectional context limits a complete understanding of the data. But in all this the biggest bottleneck introduced is by the latent vector, which though powerfully being able to represent any-length of input in a fixed length, struggles to retain all nuanced details especially for longer sequences.
To overcome this, next we will explore a novel idea that was, for a long time included in the RNN-based encoder-decoder architecture, but in the "Attention is All You Need" it completely replaces the RNN and does away with recurrence. Join me next in the Transformer Chronicles as we turn our attention to explore this groundbreaking advancement: attention mechanism.