Attention is All You Need - Key concepts.

Transformers revolutionized deep learning by replacing recurrence and convolutions with attention. This guide explains key components, Q/K/V attention, multi-head mechanisms, and shows how these ideas power modern models like GPT, BERT, and ViT.

*A standard Transformer architecture (encoder on left, decoder on right). This 2017 model relies entirely on attention mechanisms and has become the foundation of modern LLMs.*

Intro

The key idea of Vaswani et al.’s “Attention Is All You Need” paper was to replace all recurrence/convolutions with multi-head attention. In practice, this means no recurrent loops: every token in a sequence is processed in parallel. This allows much faster training (on GPUs/TPUs) and yields state-of-the-art results in translation and beyond.

Before Transformers, sequence models used encoder–decoder RNNs (LSTMs/GRUs) or CNNs. Such models had two major drawbacks:

🚫 Sequential Processing: RNN-based Seq2Seq processed one token at a time, which hinders parallelism and slows training.
🔒 Information Bottleneck: Encoding a sentence into a single fixed-size vector can “forget” early words, hurting long-range dependency modeling.

Transformers overcome these by using attention layers to let each output position directly access all inputs, enabling fully parallelized training.

Computational Efficiency and Parallelism

One of the major motivations for removing recurrence was to enable parallel computation.

Model Type	Time Complexity per Layer	Sequential Operations	Maximum Path Length
RNN	O(n·d²)	O(n)	O(n)
CNN	O(k·n·d²)	O(1)	O(logₖ(n))
Transformer	O(n²·d)	O(1)	O(1)

Parallelization: Each token attends to all others simultaneously, making Transformer training highly parallelizable on GPUs/TPUs.
Contextual Depth: Unlike RNNs, which propagate information step-by-step, Transformers access global context in a single step.

Core Idea: Attention Mechanisms

An attention mechanism lets the model “focus” on the most relevant parts of the input when producing each output. Given Queries (Q), Keys (K) and Values (V), the scaled dot-product attention is:

scores = (Q @ K.transpose(-2,-1)) / sqrt(d_k)   # pairwise similarity
weights = softmax(scores, dim=-1)               # normalized weights
output = weights @ V                            # weighted sum of values

Each input token embedding is linearly projected into three vectors: Query, Key, and Value:

🔹 Query (Q): What we are currently looking up.
🔹 Key (K): How each token is described.
🔹 Value (V): The content of each token.

Multi-Head Attention

The Transformer uses multiple heads in parallel to capture different relationships in the data. Concretely, the model splits the embedding dimension across multiple heads, concatenates their outputs, and projects back to the original embedding dimension.

Code-Level Breakdown

Multi-head attention allows the model to jointly attend to information from different representation subspaces.

Example:

def multi_head_attention(Q, K, V, n_heads):
    d_model = Q.size(-1)
    d_k = d_model // n_heads

    Q = Q.view(batch_size, n_heads, seq_len, d_k)
    K = K.view(batch_size, n_heads, seq_len, d_k)
    V = V.view(batch_size, n_heads, seq_len, d_k)

    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(weights, V)

    # Concatenate and project
    output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
    return output

Each head captures different linguistic or semantic relationships:

One head may focus on syntax (e.g., subject–verb agreement)
Another on coreference or long-range dependencies

Transformer Encoder & Decoder

Encoder Layer

Each encoder layer has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, both with residual connections and layer normalization.

Decoder Layer

The decoder layer has masked self-attention, encoder-decoder attention, and feed-forward sublayers, each with residual connections and normalization.

Positional Encoding

Since Transformers have no notion of sequence order, we add sinusoidal positional encodings:

def positional_encoding(t, i, d):
    if i % 2 == 0:
        return math.sin(t / (10000 ** (2 * i / d)))
    else:
        return math.cos(t / (10000 ** (2 * i / d)))

This encoding is added to the input embeddings to preserve order information.

Modern Alternatives

Original Transformers use sinusoidal positional encodings, but newer architectures explore:

Learned Positional Embeddings: Trainable parameters per position.
Rotary Position Embeddings (RoPE): Used in GPT-NeoX and LLaMA.
ALiBi (Attention with Linear Biases): Enables extrapolation to longer sequences.

These enhancements improve long-context modeling and extrapolation beyond training sequence lengths.

Putting It All Together

Example PyTorch implementation of a single Transformer encoder layer:

import torch
import torch.nn as nn
import math

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, nhead=8, d_ff=2048):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x, src_mask=None):
        attn_output, _ = self.self_attn(x, x, x, attn_mask=src_mask)
        x = self.norm1(x + attn_output)      # Residual + Norm
        ff_output = self.ff(x)
        x = self.norm2(x + ff_output)        # Residual + Norm
        return x

Transformer Variants and Scaling Insights

Since 2017, Transformers evolved into powerful architectures:

BERT (2018): Bidirectional encoder-only Transformer for understanding tasks.
GPT (2018–2025): Decoder-only Transformer for autoregressive generation.
T5 (2019): Text-to-text model using encoder–decoder.
ViT (2020): Vision Transformer applied to image patches.

Scaling laws (Kaplan et al., 2020) show performance improves predictably with:

Model size (parameters)
Dataset size
Compute budget

Practical Insights for Implementation

Masking:
- Use causal masks in decoders to prevent future token access.
- Use padding masks to ignore padded tokens in batches.
Initialization:
- Layer normalization before sublayers (Pre-LN) improves training stability in deep networks.
Training Efficiency:
- Use mixed precision (FP16/BF16) to speed up training.
- Gradient checkpointing reduces memory usage for large models.
Optimization:
- AdamW optimizer with learning rate warmup and cosine decay.
- Typical warmup: 10k–20k steps.
Batching:
- Dynamic batching by sequence length improves GPU utilization.

Real-World Applications

Transformers power almost all modern generative models:

ChatGPT / Claude / Gemini: Decoder-only Transformers.
BERT / RoBERTa: Encoder-only models for classification, NER, QA.
Whisper: Speech-to-text Transformer.
ViT / CLIP: Image and multimodal Transformers.

Understanding self-attention helps in adapting these models to new domains (e.g., code, audio, video).

Visualizing Attention

Attention matrices can be visualized to interpret model behavior:

import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(weights[0].detach().cpu(), cmap="viridis")
plt.title("Attention Weights for Head 1")
plt.show()

These plots often reveal:

Which words attend to others.
How long-range dependencies are modeled.

Conclusion

The Transformer’s use of attention and parallel processing revolutionized NLP. Vaswani et al.’s paper is foundational in modern AI, forming the architecture for LLMs like GPT and BERT. Understanding each component: Q/K/V vectors, multi-head attention, positional encodings, and encoder/decoder layers helps to build state-of-the-art models.