Transformer: The Soul of Large Language Models

ChatGPT, Claude, Gemini—every world-changing AI is built upon a single architecture proposed in 2017: the Transformer. At the heart of the paper "Attention Is All You Need" lies an elegant "soft retrieval" system: each word sends queries to all other words, receives answers, and after nearly a hundred layers of such interaction, the model "understands" language. This article walks you through this world-changing algorithm, step by step.

📺 Study Sources

📺 Source Video 1 — 3Blue1Brown
📺 Source Video 2 — 3Blue1Brown
📺 Source Video 3 — Grant Sanderson

This guide aims to systematically explain the core architecture, operating mechanisms, and key mathematical concepts of large language models (LLMs), based on 3Blue1Brown's in-depth explanatory videos on Transformers (Chapters 5 and 6).

I. Core Concepts of the Transformer Architecture

1.1 The Fundamental Task: Predicting the Next Token

The Transformer's core objective is to read a passage of text and predict the next token that will appear in the sequence. This process is not only the foundation of text generation but also the pathway through which the model understands the deeper meaning of language.

Tokenization: Text is broken down into small pieces, typically whole words or sub-word units.
Autoregressive Generation: The model generates long-form content word by word by repeatedly predicting the next token and feeding it back into the input sequence.

1.2 Embedding Space

Embedding Vectors: Each token is initially associated with a vector in a high-dimensional space (e.g., 12,288 dimensions for GPT-3).
Semantic Directions: In this high-dimensional space, different directions represent different semantic meanings (e.g., gender, singular/plural, nationality, etc.).
Context-Free Lookup: The initial embedding is essentially a lookup table; the same token has the same initial vector in different contexts (e.g., different meanings of "mole"). The Transformer's subsequent steps are responsible for adjusting these vectors based on context.

1.3 Data Flow Process

Input: Text is split into a sequence of tokens.
Embedding: Tokens are converted into initial vectors, incorporating positional information.
Attention Block: Vectors "communicate" with each other, updating their meaning based on surrounding tokens.
Multi-Layer Perceptron (MLP): Each vector is processed independently in parallel, further refining information through a series of "questions."
Iterative Looping: Data alternates through multiple attention blocks and MLPs (GPT-3 has 96 layers in total).
Output Processing: The final vector is mapped back to vocabulary size through an "Unembedding Matrix," then converted into a probability distribution via the Softmax function.

II. Deep Dive into the Attention Mechanism

The attention mechanism allows the model, when processing a specific token, to "attend to" other relevant tokens in the sequence, thereby absorbing contextual information.

2.1 Query, Key, Value (Q, K, V)

To compute updates, each vector generates three smaller-dimensional vectors:

Query (Q): Represents what information this token is looking for (e.g., a noun looking for adjectives that modify it).
Key (K): Represents what information this token can provide (e.g., an adjective declaring it can modify a following noun).
Value (V): If a match is successful, the actual information content to be transmitted to other tokens.

2.2 Calculation Steps

Similarity Matching: Compute the dot product between the query vectors and key vectors of all tokens to obtain relevance scores.
Masking: In generative models, to prevent "peeking" at future tokens, scores for subsequent positions are set to negative infinity.
Normalization (Softmax): Apply Softmax to the scores so that each column's weights sum to 1, forming the Attention Pattern.
Weighted Sum: Based on the attention pattern, compute a weighted sum of the value vectors of all tokens, yielding the delta (\Delta E).
Update: Add the delta back to the original embedding vector.

2.3 Multi-Head Attention

The Transformer runs multiple attention heads in parallel (GPT-3 has 96 heads per layer).

Each head has independent W_Q, W_K, W_V weight matrices.
Different heads can capture different semantic relationships (such as grammatical structure, sentiment preferences, specific factual associations, etc.).

III. Technical Parameters and Mathematical Details

3.1 Parameter Counting (Using GPT-3 as an Example)

Component	Description	Estimated Parameters
Embedding Matrix	50,257 tokens \times 12,288 dimensions	~617 million
Attention Heads	Each head contains W_Q, W_K, W_V (each in 128-dimensional space)	~6.3 million per head
Attention Blocks Total	96 layers \times 96 heads/layer	~58 billion
Total Model Parameters	Including MLP layers, Unembedding, etc.	175 billion

3.2 Key Functions

Softmax Function: Converts logits (raw scores) into a probability distribution. The formula involves exponentiation with e and normalization.
Temperature: Adjusts the smoothness of the Softmax during prediction.

* T=0: Greedy search, always picks the highest-probability token (stable but monotonous results).

* T higher: Increases the chance of selecting lower-probability tokens (more creative results but prone to hallucinations or nonsense).

IV. Short-Answer Practice Questions

In which paper was the Transformer originally proposed?

Model Answer:* The 2017 paper "Attention is All You Need."

Why is the embedding vector for the same word always identical in the Transformer's first step?

Model Answer:* Because the initial embedding is retrieved from a context-free lookup table and has not yet undergone contextualization by the attention blocks.

What is the geometric significance of the "dot product" in the attention mechanism?

Model Answer:* The dot product measures the degree of alignment (similarity) between two vectors in space. Positive values indicate similar directions, zero indicates perpendicular, and negative values indicate opposite directions.

What is "Masking"? What is its purpose?

Model Answer:* Masking is the operation of setting relevance scores of subsequent tokens to negative infinity before computing Softmax. Its purpose is to ensure that earlier tokens are not influenced by later tokens during training and inference.

Why has the Transformer architecture been so successful in the past decade?

Model Answer:* The key is its exceptional suitability for parallel computation, enabling full utilization of GPUs to complete massive calculations in a short time, thereby supporting enormous increases in model scale.

V. In-Depth Essay Questions

Discuss the concept of "semantic directions" in embedding space, and illustrate with examples how the model handles word sense disambiguation (e.g., different meanings of "mole").
Explain in detail the collaborative logic among Query (Q), Key (K), and Value (V). In the context of multi-head attention, how does this mechanism enhance the model's ability to understand complex literary works (such as Harry Potter)?
Analyze the limitations imposed by context size on model performance. Why is expanding the context window computationally challenging? (Hint: consider the growth rate of the attention pattern grid.)

VI. Glossary

Term	Definition
Token	The smallest unit of text processed by the model; can be a word, subword, or character.
Logits	The raw, unnormalized scores output by the network before entering Softmax.
Attention Pattern	A grid representing the sum of relevance weights from each token to every other token in the sequence.
Residual Connection	The process of adding the delta (\Delta E) output by the attention block back to the original input vector.
MLP (Multi-Layer Perceptron)	A module located after the attention layer that independently computes and refines each vector in parallel.
Softmax	A mathematical function that converts a vector into a probability distribution, with elements between 0 and 1 that sum to 1.
Unembedding Matrix	The weight matrix that maps high-dimensional semantic vectors back to a probability distribution over the vocabulary.
Temperature	A tuning parameter that controls the concentration of the model's output probability distribution.