What are the query, key, and value vectors?
In a transformer architecture, “key,” “query,” and “value” are fundamental components used in the mechanism of attention. Attention is an important part of transformers, allowing the model to focus on different parts of the input sequence when processing information.
These components play a key role in computing the attention scores and aggregating information from different parts of the input. The ”query,” “key,” and “value” vectors are components used in the computation of attention scores and in aggregating information across sequences of data. These vectors are typically linear projections of the input data and play specific roles in the attention mechanism:
1. Query Vector:
- The query vector represents the element of interest or the context that you want to obtain information about. It is usually derived from the current position in the input sequence or the output of the previous layer.
- The query vector is used to determine the similarity or relevance between this context and other elements in the input sequence, specifically the key vectors.
- Suppose you’re translating a sentence from English to French, and you’re at a particular word in the English sentence (the query). The keys are representations of all words in the English sentence, and the values are their corresponding translations in French. Example :
“apple” (the word you want to translate)
2. Key Vector:
- The key vector, like the query vector, is a projection of the input data and is associated with each element in the input sequence.
- The key vectors are used to compute how relevant each element in the input sequence is to the query.
- This relevance is often calculated using a dot product or another similarity measure between the query and key vectors.
- Example :
[“cat”, “apple”, “tree”, “juice”] (representations of words in English)
3. Value Vector:
- The value vector is also a projection of the input data and is associated with each element in the input sequence, just like the key vector.
- The value vectors store the actual information that will be used to update the representation of the query. These values are weighted by the attention scores (computed from the query-key interaction) to determine how much each element contributes to the final output.
- The attention scores, computed based on the query and key, are used to weight the value vectors. Higher attention scores mean that the corresponding values are more important for the output.
- Example :[“chat”, “pomme”, “arbre”, “jus”] (corresponding French translations)
To calculate the attention scores:
- The dot product between the query (“apple”) and each key (“cat,” “apple,” “tree,” “juice”) determines how relevant each word in the English sentence is to “apple.” Higher dot products indicate higher relevance.
- These scores are then used to weight the corresponding values (“chat,” “pomme,” “arbre,” “jus”) when generating the translation for “apple.”
In this way, the transformer can attend to and aggregate information from the entire input sequence based on the query’s relevance to each key, producing accurate translations and capturing complex relationships within the data.
Here, To calculate the attention scores, you perform a dot product between the query vector (“apple”) and each key vector in the sequence (“cat,” “apple,” “tree,” “juice”). This dot product measures the similarity or relevance between the query and each key. Higher dot products indicate greater relevance.
For example, let’s calculate the attention scores:
- Dot product(query, “cat”) = high similarity/relevance (e.g., 0.9)
- Dot product(query, “apple”) = very high similarity/relevance (e.g., 0.95)
- Dot product(query, “tree”) = low similarity/relevance (e.g., 0.2)
- Dot product(query, “juice”) = moderate similarity/relevance (e.g., 0.6)
These computed scores indicate that “apple” is highly relevant to “apple” and “cat,” moderately relevant to “juice,” and less relevant to “tree.”
Next, the attention scores are passed through a softmax function, which scales them to produce a probability distribution. This distribution determines the attention weights for each key-value pair. Tokens with higher attention scores get higher weights, meaning they contribute more to the final output.
For instance, after applying softmax:
- Attention weight for “cat” = 0.25
- Attention weight for “apple” = 0.35
- Attention weight for “tree” = 0.10
- Attention weight for “juice” = 0.30
Now, you use these attention weights to combine the value vectors (“chat,” “pomme,” “arbre,” “jus”). The weighted sum of the value vectors is the final output for the query token “apple.”
For example, the final output for “apple” would be:
- (0.25 * “chat”) + (0.35 * “pomme”) + (0.10 * “arbre”) + (0.30 * “jus”) = French translation of “apple” in the context of the sentence.
This way, self-attention allows the transformer to focus on relevant parts of the input sequence, capture dependencies between tokens, and generate context-aware representations, making it a powerful tool for various natural language processing tasks.
The attention score is calculated using the dot product of the query and key vectors, and this score is then scaled (usually by dividing by the square root of the dimension of the key vectors, as indicated by “Divide by 8 * dk” in the image). After scaling, a SoftMax function is applied to the scores to convert them into a probability distribution, which is used as weights for the value vectors.
The resulting weighted sum of the value vectors determines the output at each position in the sequence. This mechanism allows the model to consider the entire input sequence and to focus more on the relevant parts as determined by the attention scores. The output for each query is a weighted combination of all values, where the weights are the attention scores. This allows the model to capture dependencies and relationships within the sequence data effectively.
Support my research journey!
If you find my work useful and would like to support my ongoing research, feel free to buy me a coffee ☕
Additionally, I am currently looking for PhD opportunities in AI and Media. If you come across any opportunities, please feel free to reach out to me!