Attention Mechanism in LLM

This document provides a combined and structured overview of the self-attention mechanism, which forms the core of the Attention Layer in the Transformer architecture of LLM.


The Self-Attention Mechanism: The Engine of Transformer Architecture

The Self-Attention Mechanism (SAM) is the fundamental component of the Transformer architecture, used in models like GPT. Its primary function is to transform a sequence of input embedding vectors into a sequence of enriched Context Vectors.

This process allows the model to weigh the importance of every other word in the input sequence when processing a specific word, thus capturing long-range dependencies and context. The mechanism is also formally known as Scaled Dot-Product Attention.

Core Objective: The Context Vector

For every input token (word), the attention mechanism computes a Context Vector. This vector is an enriched representation that includes:

  1. The original semantic meaning of the token.

  2. Weighted information about its relationship and relevance to all other tokens in the sequence, allowing it to capture context (e.g., in "The bank of the river," the word "bank" is contextualized by "river").


That's an excellent question that gets right to the core of why the Transformer's design is so powerful! The shift from simple similarity to using the three learned matrices (Q, K, V) is what makes the architecture revolutionary.

Here is a breakdown of how attention weights are calculated in theory, and how the introduction of WQ, WK, and WV makes the mechanism truly transformative.


1. Attention Without Trainable Matrices (Conceptual Baseline)

Before introducing trainable weights, the simplest form of attention—often used as a starting point for intuition—relies on direct similarity between the encoded input vectors.

In this conceptual model:

  1. Input: You start with a sequence of input embedding vectors (e.g., from a word embedding layer, X).

  2. Query and Key: Both the Query and the Key for a given token are considered to be the same as its input embedding vector ().

  3. Similarity Score Calculation (Dot Product): To find out how much a query token (say, "bank") should pay attention to another key token (say, "river"), you calculate a similarity score between their respective input vectors. The most common and simple way to do this is using the dot product:

    • This operation measures the alignment and magnitude of the two vectors. If they point in a similar direction (conceptually related), the score will be high.

  4. Normalization: These raw scores are then normalized (e.g., using SoftMax) to produce the initial attention weights.

  5. Context Vector: Finally, the context vector for the query token is the weighted sum of all input vectors, using the calculated weights.

In this baseline model, attention is calculated based purely on the fixed, initial content of the input embeddings. The model learns nothing about how to strategically focus its attention; it just uses the inherent semantic similarity baked into the word embeddings.


2. Attention With Trainable Matrices: 


Why Introduce Three Distinct Trainable Weight Matrices?

The limitation of the conceptual baseline is that it's passive. The introduction of the three distinct, trainable weight matricesWQ, WK, and WV—transforms the process from static similarity into a dynamic, learning mechanism.

Q=XWQK=XWKV=XWV

1. Separate Roles (The "Search Engine" Analogy)

The core reason for three matrices is to give the model the ability to learn three different perspectives for each token:

  • (Query perspective): Teaches the model how to best ask a question about the current token's context.

  • (Key perspective): Teaches the model how to best label its content so it can be successfully matched by others.

  • (Value perspective): Teaches the model how to best package its information to be combined into the final context vector.

2. Learned Task-Specific Attention

By making WQ and WK separate, the model can learn to focus attention differently based on the required task. The vectors are no longer constrained by their initial content:

  • The Query for the word "animal" might learn a projection that emphasizes its syntactic role (e.g., a noun) when calculating attention.

  • The Key for the word "eats" might learn a projection that emphasizes its action role (e.g., a verb) when being attended to.

This separation allows the dot product, , to capture complex, task-specific relationships (e.g., subject-verb agreement, temporal links) rather than just general semantic similarity.

3. Creating the Final Context Vector

The Value matrix (WV) plays a distinct, crucial role: it ensures that the content being aggregated into the final context vector is optimally represented.

The Value vector's content is decoupled from the content used for scoring (Q and K). This means the model can use a highly specific, fine-tuned representation of the token's content for the final aggregation, even if the Q and K vectors needed to be transformed into highly abstract representations purely for the scoring process.

In summary, the WQ, WK, and WV matrices introduce trainable parameters that allow the Transformer to learn an optimal, dynamic, and context-aware function for calculating attention, far exceeding the capabilities of simple, fixed similarity scores.


Key Components: Query, Key, and Value (QKV)

The mechanism introduces three trainable weight matrices, which are optimized during model training to create three new representations for every input token:

ComponentAnalogy/IntuitionPurpose
Query ()The search query/focusRepresents the current token being processed (What am I looking for?).
Key ()The index/labelRepresents all other tokens in the sequence (How do I match the query?).
Value ()The actual content/dataRepresents the content to be retrieved once a match is made.

The Four Steps of Self-Attention

The self-attention mechanism can be broken down into four essential steps:

Step 1: Input Embeddings to Q, K, and V Matrices

The input matrix, which contains the embedding vectors for all tokens in the sequence, is multiplied by the three trainable weight matrices (, , ).

These trainable weight matrices project the input embeddings into three distinct representation spaces (Q, K, V).

Step 2: Calculate Attention Scores

To determine the relevance between the current query and all keys, a dot product is performed between the Queries matrix () and the transpose of the Keys matrix ().

This matrix multiplication results in an matrix (where is the sequence length), in which each row represents the alignment score of one token (query) with every other token (key). A higher score indicates a stronger alignment.

Step 3: Compute Attention Weights (Scaling and SoftMax)

The attention scores are converted into Attention Weights through a two-part process:

  1. Scaling: The scores are divided by the square root of the dimension of the key vectors (


    .

  2. Normalization (SoftMax): The SoftMax function is applied row-wise to ensure that the weights for any given query sum up to 1.



The resulting matrix is the definitive set of weights, quantifying the "attention percentage" each word should pay to all other words.

The Importance of Scaling 

The scaling factor is crucial for learning stability. Without it, the dot product values can become extremely large as the vector dimension () increases. When these large values are passed to the SoftMax function, the output becomes peaky (one weight approaches 1, others approach 0), making learning unstable. Dividing by sqrt of Dk helps:

  1. Reduce Magnitude: Prevents overly large input to SoftMax.

  2. Normalize Variance: Keeps the variance of the dot product close to 1, regardless of the vector dimension.

Step 4: Generate the Context Vector

The final step involves multiplying the Attention Weights with the Values () matrix.

This operation performs a weighted sum of the Value vectors, using the Attention Weights as coefficients. The result is the final Context Vector Matrix, where each row is the enriched representation for its corresponding token.


Unmasking the Magic: Understanding and Implementing Causal Self-Attention

Introduction: Why LLMs Can't Peek Ahead

Large Language Models (LLMs) like GPT are built upon the revolutionary Transformer architecture, and at the heart of the Transformer is the Attention Mechanism. This mechanism allows a model to weigh the importance of different tokens in a sequence when processing any single token.

However, for a model designed for text generation—where it predicts the next word based only on the words that came before it—standard attention is a problem. If the model is predicting the fifth word in a sentence, it shouldn't be allowed to "peek" at the sixth, seventh, or eighth word.

This necessity for sequential, time-aware processing gives rise to the Causal Self-Attention mechanism, also known as Masked Attention.

The Causal Constraint: An Irreversible Flow

Causal Attention’s core requirement is simple: when computing the representation for a token, the model must only attend to the current token and all tokens that preceded it. Future tokens must be completely ignored.

In the attention matrix, this constraint translates to a visual pattern: the area above the main diagonal (representing future tokens) must be zeroed out.

The Problem with a Naive Mask

A seemingly straightforward approach is to calculate the attention weights, set the future tokens to zero, and then re-normalize. However, this leads to a critical flaw called data leakage:

  1. Standard attention scores are calculated.

  2. The softmax function is applied to these scores.

  3. The softmax denominator involves the sum of all scores (including future ones). Even if you set the future weights to zero after softmax, the past weights have already been unfairly scaled down by the influence of those future scores.

The Solution: The Negative Infinity Mask Trick

The elegant and correct solution is to apply the mask to the raw attention scores (before softmax).

  1. Calculate Raw Scores: Compute to get the initial attention scores.

  2. Apply the Infinity Mask: Instead of replacing future token scores with zero, we replace them with a very large negative number, specifically (or a numerically stable equivalent like 1015).

  3. The Softmax Effect: When softmax is applied:

    • The exponent of becomes e, which is 0.

    • The future tokens are effectively ignored in the normalization sum.

    • The remaining unmasked weights are automatically re-normalized correctly to sum to one.

This single step simultaneously solves the masking requirement and the data leakage problem, providing true causal attention.

Implementing Causal Attention in PyTorch

In a complete PyTorch implementation, the CausalAttention module includes all the necessary components: the Query, Key, and Value weights, the Causal Mask logic, and a Dropout layer for robustness.

Here is a look at the core logic for the causal mask within the forward pass:

Python
class CausalAttention(nn.Module):
    # ... (Initialization code for W_key, W_query, W_value, dropout, etc.)

    def forward(self, x):
        # ... (Compute Q, K, V from input x)
        
        # 1. Compute Scaled Attention Scores
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k) 

        # 2. Get the Causal Mask (T is current sequence length)
        T = x.shape[1]
        causal_mask = self.mask[:T, :T] 
        
        # 3. Apply the Infinity Mask
        # self.mask is an upper triangular tensor (1s above the diagonal)
        scores = scores.masked_fill(causal_mask == 1, float('-inf'))
        
        # 4. Apply Softmax and Dropout
        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # 5. Compute Context Vector
        context_vector = attention_weights @ V 
        
        return context_vector

Key Takeaways

The Causal Self-Attention mechanism is not just a modification; it is the fundamental enabler of autoregressive (generative) sequence models. By utilizing the simple yet powerful trick of the Negative Infinity Mask, we ensure that models like GPT abide by the irreversible nature of time, looking only at the past to confidently predict the future.


Post a Comment

0 Comments