transformer

Transformer was originally created for language translation tasks.

Example:
Input: This is a pencil
Output: Đây là một cái bút chì

Before Transformer, sentence generation problems were commonly solved using RNN-based models, specifically sequence-to-sequence (seq2seq) models.

The simplest RNN translation model uses an encoder-decoder architecture consisting of two RNNs, usually LSTMs. One RNN acts as the encoder and the other as the decoder. The encoder reads the source sentence, and its final hidden state is passed as the initial hidden state of the decoder. The idea is that this final encoder state encodes all the information of the source sentence, allowing the decoder to generate the target sentence from this vector.

RNN Model with Cross-Attention

Attention helps the model focus on the correct words when translating a new word. At each decoding step, the model decides which parts of the source sentence are more important. Instead of compressing the entire source sentence into a single vector, the encoder provides representations for all source tokens, such as all RNN hidden states, not just the final one.

TRANSFORMER ARCHITECTURE

Below is the architecture of the Transformer. The core idea is Multi-Head Attention. The Transformer is divided into two main parts: the encoder and the decoder.

The encoder supports extracting the semantic meaning of sentences, while the decoder supports generating new sentences. Models like BERT use only the encoder part, whereas GPT and other large language models use the decoder part.

Word Embedding

Word embedding maps words into continuous vector representations. These vectors capture semantic relationships between words and serve as the input to the Transformer model.

Positional Encoding

If positional encoding is not used, the following two sentences would be considered the same by the model:

The cute cat is so fat
The fat cat is so cute

Positional encoding is used to convey the position of words within a sequence to the Transformer. Instead of relying on word order implicitly, positional encodings are generated using sine and cosine functions and added to word embeddings.

Positional encoding allows the model to understand the structure and order of a sentence.

Why Add Instead of Concatenate?

Semantic preservation is one reason. Word embeddings represent the meaning of words. Adding positional embeddings preserves the semantic representation, while concatenation could mix semantic and positional information in a way that reduces clarity.

Another reason is positional insight. Positional embeddings provide information about word order. Adding them enriches the word embeddings with structural context without increasing dimensionality.

Multi-Head Attention

Multi-Head Attention applies the attention mechanism multiple times in parallel.

Consider the word “mole” in the following examples:
American shrew mole
One mole of carbon dioxide
Take a biopsy of the mole

The meaning of the word “mole” changes depending on context.

Another example sentence is:
“A fluffy blue creature roamed the verdant forest”

The meaning vectors of words like “creature” and “verdant” change based on their surrounding words.

The process begins by forming a query, which can be thought of as asking a question such as “Are there any adjectives related to me?” This is done by multiplying the word embedding with a weight matrix Wq to form a Query vector.

By multiplying with Wq, the word “creature” is mapped from the embedding space into a Query/Key space with reduced dimensionality.

Similarly, Keys are created by multiplying word embeddings with Wk. Words like “fluffy” and “blue” are also mapped into the same Query/Key space, where related words move closer together.

The model then computes the dot product between all pairs of Query and Key vectors in the sentence. This helps determine how strongly words are related.

Finally, Value vectors are created by multiplying word embeddings with Wv. These Value vectors are used to compute the final contextual meaning of words such as “creature.”

Multi-Head Attention consists of multiple single-head attention layers running in parallel, allowing the model to capture different types of relationships at the same time.

Softmax

Softmax is used in various machine learning tasks, especially multi-class classification. It converts a vector of numbers into a probability distribution. In attention mechanisms, Softmax is used to normalize attention scores so they can be interpreted as probabilities.

Calculate Parameters of GPT-3

The diagram shows how parameters are distributed in GPT-3. Approximately 57 billion parameters come from Multi-Head Attention, while most of the remaining parameters come from the Feed-Forward layers.

Conclusion

By learning from neighboring words, the Transformer model can understand sentence structure and contextual meaning. With more parameters and optimized training and inference processes, Transformers demonstrate strong potential in advancing artificial general intelligence.

Some strong points of Transformer include the ability to process multiple words in parallel, learn from unlabeled data, and understand long texts, which is a known weakness of RNN-based models.

Additional Resources

Attention Is All You Need
Wikidocs.net
3Blue1Brown YouTube video
Medium: Demystifying Transformer Architecture – The Magic of Positional Encoding

About the Author

Nguyen Quang Huy
Without dreams to chase, life becomes mundane.