Generating outputs in Transformers

Generating outputs in Transformers

Previous articles in this series have shown that words in human language must be converted into number vectors to allow artificial neural networks to interpret them. Then because Transformers process inputs in parallel they need information about where a particular word is located in a sequence and its relationship with the words that came before it. All that information conveys the meaning of a sequence of words or in other words the context for those words. A neural network learns that context, not the words. The goal of training Transformer networks is to feed them with a large variety of contexts so that they can generalize the rules of inference that humans apply to sequences of words that they perceive. However, humans receive more than just words from the environment but also other information such as the feeling of warm and cold, bright and dark, etc., which may enhance the meaning of a sequence of words under various conditions. In this analysis, we will focus on the context that is conveyed by text alone like in the case of reading a book.

Article content
Figure 1. Artificial neural network

As the previous article showed, the output vector of word preprocessing dubbed ANN Inputs is fed to a neural network (Figure 1. Artificial neural network) that processes an encoded-word within a given context to predict an output word. The most often used encoding of outputs is one hot meaning the largest value picks the word that corresponds to it. Other encoding methods are possible but can be harder to train although they may be worth the effort if the vocabulary is extensive. With one hot encoding, the network may not need deep learning because a single fully connected layer may be sufficient.

A Transformer network belongs to a class of generative networks which means it is not limited to mapping an input into an output class of concepts but it generates outputs based on the received input that is incrementally extended by already generated outputs. It keeps enriching its input until it generates a special token that ends the process. The richer the variety of training contexts and the greater the ability of a Transformer to distinguish them the more thorough the output sequences. Unfortunately, the ability to distinguish subtle differences between contexts comes at the price of the size of a Transformer or in other words the number of trainable parameters which translates into the dimensionality of a network space that was explained in the second article ”Encoding Relationships between Words”.

When processing a prompt, a Transformer is expected to keep predicting consecutive words in that prompt until the EOS (End of Sequence)  token is encountered, which is the last element of that prompt. It is important that a network can predict words in a prompt or their synonyms because it means that it recognizes the context and proves its ability to properly extend it. When the input EOS token is reached the network starts predicting words that extend the input context but are not provided as input. Here we can see the importance of proper word embedding where synonyms have the highest similarities and antonyms the smallest. Since the training of embedding is on sub-word tokens, not words, the goal can be achieved with large amounts of well-chosen text, which requires an iterative trial and error process.

Let’s assume the input and output sequence is as follows:

“The flower that grew up in a pot smells great<EOS>Sniff it<EOS>”.

The first EOS token that is provided with the prompt tells the network to start generating output words, i.e. extend the input context, and the second EOS token that is generated by the network itself means that the network finished generating an output sequence. The first part is obtained as a prompt and the second is the trained ability of a network to extend the first part. This implies that training should include a wider elaboration of context scopes to allow a network to pick the right context based on its synonymous fragment.

Article content
Figure 2. Generating output “sniff”

Once an output word is predicted and it is not the EOS token, it is appended to the input prompt for further processing as shown in Figure 2. Generating output. In this case, it is the word “sniff”, which is then appended to the input sequence to compute its masked self-attention including the new word.

Article content
Figure 3. Generating output “it”

As a result, the network predicts the word “it”, which is then appended to the input sequence the same way as the word “sniff” as shown in Figure 3. Generating output “it”.

Article content
Figure 4. Generating output EOS

Finally, the network predicts the token EOS as shown in Figure 4. Generating output EOS, which ends the process of generating output words. As we can see, unlike classification neural networks which predict a class of concepts that is described by an input vector, the Transformer network generates a sequence representing a context that it predicts based on the input sequence that describes the fragment of that context. So despite the common intuition, the behavior of both types of networks is in general very similar. Instead of a class of concepts, the result is a sequence of concepts, e.g. words, genes, etc. That similarity may imply that we should expect comparable types of problems resulting from the dimensionality of a network and the quality of training. Poor separation of classes may lead to scrambling those classes when the input is close to the border between those classes. The expected result is misclassification which in the case of Transformers is an incorrect output context, wrong words, or even an empty output sequence.

A reader can easily notice that the attention module is an expandable neural network since an artificial neuron produces a similarity value, i.e. a dot product between its weights (Key) and an input vector (Query) that is then passed through a nonlinear activation function, and in this case, it is a Softmax function. However, since the Key is not a vector of weights per se, the training is performed on a Key matrix that produces the Key as a result of multiplication by the word embedding vector, but this will be elaborated in the following article.

Thank you for reading our newsletter and stay tuned for more updates from Vault Security!


To view or add a comment, sign in

More articles by Vault Security

Others also viewed

Explore content categories