Explaining Attention is all you need for babies | Part 01

10 min readDec 9, 2023

“Attention Is All You Need” is a groundbreaking research paper in the field of artificial intelligence, particularly focusing on natural language processing (NLP). It introduced the Transformer model, which revolutionized how machines understand and generate language.

I’ll walk through each step with a simple example and a bit of the math involved. Let’s imagine we want to translate the English sentence :

“I like cats” into French.

Now , Let’s dive deeper into each step with more detail, focusing on the math and the mechanics of how a Transformer model processes an input sequence. For our example, we’ll stick with the English sentence “I like cats” that we want to translate into French.

1. Inputs:

These are the words of our English sentence. Each word is an input to the model.

Example: “I”, “like”, “cats”.

2. Input Embedding:

In the Transformer model, each input word is converted into a high-dimensional vector. This vector captures the semantic meaning of the word within the context of the entire language the model has been trained on. This process is done through an embedding layer which is a trainable component of the Transformer.

The embedding layer is essentially a lookup table. If we have a vocabulary of size V (where V is the total number of unique words the model knows), and we want our embeddings to be of dimension d, then the embedding matrix E will have a shape of V × d.

Each word in our vocabulary is assigned a unique index. For example, in a simple model:

“I” might be index 1
“like” might be index 2
“cats” might be index 3

If w is the index of a word in the vocabulary, then the embedding v of that word is the w-th row of the embedding matrix E.

So, the math for obtaining the embedding of a word can be represented as:

This means we are grabbing the w-th row from matrix E. For example:

Let’s assume our embedding dimension d is 512 (which is common in many Transformer models). When we say “I” might become [0.2,−1.3,0.9,…,0.5][0.2,−1.3,0.9,…,0.5], this vector is actually 512 numbers long, but for simplicity, we’re just showing three of them here. The same goes for “like” and “cats”.

For “I”, which has an index of 1, we take the first row of the embedding matrix E. This row is a vector of 512 numbers. Let’s say the first three numbers are [0.2, -1.3, 0.9], the rest would follow to complete the vector.
For “like”, with an index of 2, we take the second row of E, which might start with [2.1, 0.1, 1.7], and so on for the rest of the 512 numbers.
For “cats”, with an index of 3, we take the third row of E, and this vector might start with [0.5, 2.2, -0.7], continuing with 509 more numbers to complete the embedding.

Each number in these vectors is initialized randomly and then gradually adjusted through the training process as the model learns which numbers help it to best predict the next word in a sentence or the correct translation of a sentence.

3. Positional Encoding :

Positional Encoding is an important component in the Transformer model architecture. It provides the model with information about the order of words or the position of words within a sequence. Since the Transformer model does not inherently process data sequentially (like RNNs or LSTMs), it does not automatically account for the position of each word. Positional Encodings are added to the embedding vectors to give the model this positional awareness.

The original Transformer paper uses sine and cosine functions of different frequencies to compute the positional encodings. The intuition is that these functions can provide a unique signal for each time step and allow the model to easily learn to attend by relative positions since for any fixed offset k, sin(θ+k) and cos(θ+k) will be linearly related to sin(θ) and cos(θ).

Here is the formula used for positional encoding in the Transformer:

For each position pos in the sequence (0-indexed) and each dimension i in the embedding space (also 0-indexed), the positional encoding PE is defined as:

where :

d is the dimension of the embeddings (the same as the dimension of the word vectors).
pos: This is the position of a word in the sentence. For example, in the sentence “I like cats”, “I” is at position 0, “like” is at position 1, and “cats” is at position 2.
i: This is an index to the dimensions of the word’s vector in the embedding space. If the embedding size is 512, i ranges from 0 to 511.
sin and cos functions: These trigonometric functions are periodic, and their values repeat after a certain interval. This property is used to encode the position in a way that the model can distinguish between different positions.
10000^(2i/d): This term uses a decay factor to decrease the frequency of the sine and cosine functions as i increases. This helps the model to learn to attend by relative positions because the “wavelength” of the sine/cosine waves is longer for lower dimensions and shorter for higher dimensions.

Now , let’s go through a detailed example of how positional encoding is calculated and applied to an embedding vector. We’ll use the word “like” from the sentence “I like cats” and we’ll assume our embedding size is 6 for simplicity (real models use larger embeddings, like 512 dimensions).

Let’s say the word “like” has been turned into an embedding vector based on its meaning, and we’ve got the following 6-dimensional vector:

Now, we need to add positional encoding to this vector to give the model information about where “like” is positioned in the sentence. here, d is 6 in this example . So, For “like”, which is at position 1 in the sentence, the positional encodings for a 6-dimensional embedding vector are calculated as follows:

Now, let’s calculate the positional encodings for the first position (pos = 0):

and this pattern will continue for i = 2, giving us zeros for sine and ones for cosine, because sin⁡(0)sin(0) is always 0 and cos⁡(0)cos(0) is always 1, regardless of i.

Now let’s calculate the positional encodings for the second position (pos = 1):

For i=0 (and thus 2i = 0, 2i+1 = 1):

For i=1 (and thus 2i = 2, 2i+1 = 3):

… and so on for the rest of the dimensions. If we were to calculate these values and round them for simplicity, we might get vectors like:

Positional encoding for pos = 0: [0,1,0,1,0,1][0,1,0,1,0,1]
Positional encoding for pos = 1: [0.84,0.54,…,…][0.84,0.54,…,…] — The rest of the values would be calculated using the sine and cosine with the decay factor applied.

The positional encoding vectors would be added to the respective word embeddings to give the model information about the position of the word in the sentence. Each word’s final vector would then represent not just its own meaning but also its position in the sequence, which is important for understanding the sentence structure and grammar.

4. Multi-Head Attention

This is where things get interesting.

The multi-head attention mechanism allows the model to focus on different parts of the input sequence as it processes each word.

Multi-Head Attention is a mechanism in the Transformer model that allows the model to jointly attend to information from different representation subspaces at different positions. It’s like being able to simultaneously focus on different parts of a sentence to understand its meaning better.

Before diving into Multi-Head Attention, it’s important to understand the basic concept of attention in this context. Attention mechanisms allow a model to focus on different parts of the input sequence when producing a particular part of the output sequence. In a sense, the model “pays attention” to relevant information and “ignores” irrelevant details.

The attention mechanism the Transformer uses is called “Scaled Dot-Product Attention”. The scaling factor is used to prevent the dot product from growing too large in magnitude, which could lead to vanishing gradients during training.

Here’s the Scaled Dot-Product Attention formula:

Where :

Q is a matrix that contains the query (representing the word(s) to be focused on).
K is a matrix that contains the keys (representing all the words in the sequence).
V is a matrix that contains the values (also representing all the words in the sequence).
dk is the dimension of the keys (and queries), which is used to scale the dot products.

The SoftMax function is applied to the result of the scaled dot product of queries and keys, which gives us a probability distribution. This distribution is used to weigh the values.

Now, instead of performing the attention function once, the Transformer does it multiple times in parallel — these are the “heads” in Multi-Head Attention. Each head learns different things; one might focus on the subject of the sentence, another on the tense of the verb, etc.

Here’s how Multi-Head Attention works in more detail:

Linear projections: The input vectors (embeddings of words) are linearly transformed into queries, keys, and values multiple times for each head.
Scaled Dot-Product Attention: Each head performs the attention operation on its respective queries, keys, and values.
Concatenation: The outputs of each head are concatenated into a single matrix.
Final Linear Projection: The concatenated matrix is once again linearly transformed.

Let’s use an example with a sentence “The cat sat on the mat”. Imagine we’re trying to predict the next word after “the”.

We create queries, keys, and values for each word by multiplying the embedding by three weight matrices that we learn during training. If we have 2 heads, we do this twice, ending up with two sets of Qs, Ks, and Vs.
For each head, we calculate the dot product of the query for “the” with all the keys, which gives us scores.
We scale the scores by dividing by dk (let’s say dk is 64, so we divide by 8), which helps stabilize the gradients during training.
We apply SoftMax to the scaled scores, turning them into probabilities. This tells us how much focus to put on each word for each head.
Each set of scores is used to create a weighted sum of the values. Each head now has an output that is a different representation of “the” based on different weighted combinations of all the words’ information.
We concatenate the outputs of the two heads.
We multiply this concatenated vector by another learned weight matrix to get the final output of the Multi-Head Attention layer.

Through training, each head specializes to pay attention to different parts of the input. For example:

Head 1 might learn to put more weight on “cat” and “sat” since they are directly related to “the”.
Head 2 might focus on “mat”, understanding that “the” is often followed by a noun, and “mat” is a noun that has not been mentioned yet.

Multi-Head Attention allows the model to capture different types of dependencies in the input, such as syntactic and semantic relationships. This is part of what makes the Transformer model powerful for a variety of sequence-to-sequence tasks.

5. Add & Norm

The “Add & Norm” step in the Transformer model is a combination of two sub-steps: a residual connection (the “Add” part) and layer normalization (the “Norm” part). Let’s break down each of these steps:

Residual Connection (Add)

The residual connection is a way of connecting the input of a layer to its output, which helps in training deep networks by allowing gradients to flow through the network directly. In practice, this means that the output of any layer is the sum of its input and its transformed output.

Given an input x and a function F representing the transformations within the layer (such as multi-head attention or a feed-forward neural network), the output with the residual connection is:

This simple addition helps to mitigate the vanishing gradient problem in deep networks, where gradients become so small during backpropagation that learning effectively stops.

Layer Normalization (Norm)

Layer normalization is a technique to normalize the inputs across the features instead of the batch dimension. It stabilizes the learning process and reduces the training time. Layer normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. If y is the output from the residual connection, then the layer normalization is defined as:

where:

μ is the mean of the elements of y.
σ2 is the variance of the elements of y.
ϵ is a small number to prevent division by zero.
g and b are trainable parameters of the layer normalization (gain and bias).
⊙ denotes element-wise multiplication.

Let’s consider an example where we have a Transformer block processing an input vector x through a multi-head attention sublayer, which is the function F in our case. Suppose our input x and the output of the function F(x) are both vectors of the form [x1,x2,…,xn] and [f1,f2,…,fn], respectively.

In this way, the “Add & Norm” step combines the benefits of residual connections (allowing gradients to flow through the network without attenuation) with the benefits of layer normalization (ensuring that the inputs to each layer have a mean of 0 and a variance of 1, which can speed up training and lead to better performance).