TCombinator

Attention in 3 lines

Table of contents

Pemberton, BC

Attention in 3 lines

"They're made out of weights."

"Weights?"

"Weights. Floating-point numbers. We checked the whole thing through. It's nothing but weights."

"Weights doing what? Where do the words come from?"

Short excerpt from https://maxleiter.com/blog/weights

LLMs can feel mysterious because the finished systems are enormous. The basic transformer idea is smaller than that: turn text into numbers, let every token look at the tokens before it, and use the result to predict what comes next. A lot of their capabilities emerge from the scale of training they undergo; see why scale matters.

The important word in that sentence is look. Attention is the part of the transformer that decides which previous tokens are useful for understanding the current token i.e. which token to pay more Attention too

Text Becomes Tokens

Computers do not see words the way we do. A model receives numbers, so the first step is tokenization.

Tokenization splits text into smaller pieces called tokens. A token might be a word, part of a word, punctuation, whitespace, or some repeated text fragment. Each token gets an integer ID.

For example:

"the bird is red" -> ["the", "bird", "is", "red"]

Those token strings are then mapped to IDs:

["the", "bird", "is", "red"] -> [1820, 10214, 374, 2579]

The exact IDs do not matter here. They are just labels. If bird is token 10214, that number does not contain bird-ness. It is no more meaningful than a row number in a spreadsheet.

Tokenization → Embedding

Words become vectors

Text to embedding Each word is mapped to an embedding vector. The animation cycles through the words of a sentence and shows their vector representations forming. word the embed embedding vector

Tokens Become Vectors

Token IDs become useful only after we map them into vectors. These vectors are called embeddings.

An embedding is a list of numbers learned during training. Instead of treating a token as an isolated dictionary entry, the model stores it as a point in a learned space. Tokens used in similar contexts tend to end up near each other.

This idea became famous through word2vec. The training goal was not to manually teach the model concepts like gender, grammar, or analogy. The goal was much simpler:

Take this sentence:

The cat sat on the mat.

If the model sees cat, it should assign higher probability to nearby words like the, sat, and maybe mat than to unrelated words. Do this over a huge amount of text, and the model starts learning that words appearing in similar neighborhoods often have related meanings.

That is why examples like this became so striking:

king - man + woman ~= queen

Embedding arithmetic

Vector directions can carry relationships

King minus man plus woman lands near queen A looping vector addition diagram showing king minus man plus woman tracing a path that ends close to, but not exactly on, the queen vector in embedding space. dim 1 dim 2 king −man +woman queen

The model was not explicitly told "king is to man as queen is to woman." That rough structure emerged because the vectors were optimized to be useful for prediction.

One caveat: individual dimensions usually do not map cleanly to human concepts. Dimension 1 is not "gender", dimension 2 is not "color", and dimension 3 is not "animal-ness". In large embeddings, meaning is usually distributed across many dimensions at once.

Embedding space

Random IDs become useful neighborhoods

before training
Before and after embedding clusters A scatter plot where word vectors begin randomly scattered and move into semantic clusters as training progress increases.
With training, words of similiar concepts end up together

At this point each token has a vector, but each vector is still mostly context-free. The embedding for bird represents a broad idea of bird. It does not yet know whether this bird is red, flying, extinct, angry, or part of a company name.

That is where attention enters.

Attention Adds Context

Take the sentence:

the bird is red

After tokenization and embedding, the model has one vector for each token:

the   -> vector
bird  -> vector
is    -> vector
red   -> vector

But the token bird should not stay generic. In this sentence, it should absorb information from red. The representation we want is closer to:

bird = bird, but informed by "red"

A naive version would be:

bird_context =
  some_amount_of("the") +
  some_amount_of("bird") +
  some_amount_of("is") +
  some_amount_of("red")

That is Attention in plain English. Each token builds a weighted mixture of other token vectors. High weight means "this token matters to me right now." Low weight means "ignore most of this."

Adding a red vector to a bird vector A looping vector diagram showing a bird vector, a red context vector added to it, and the resulting red bird vector. start add context updated meaning 0 bird bird vector 0 bird + red bird + red direction 0 red bird new context-aware vector + =

The result is a new vector for every token. Not just bird, but also the, is, and red. Every position gets rewritten as a context-aware representation.

The 3 Lines

Here is the core attention operation in three lines:

x contains the embedded vectors from previous step

scores = (x @ x.T) / math.sqrt(d)
weights = masked_softmax(scores)
context = weights @ x

That is the heart of it.

The dot product x @ x.T measures similarity between every pair of tokens. If two token vectors point in similar directions, their dot product is larger. Larger score means the model should pay more attention.

Attention weights

Softmaxed weights with causal masking

Attention weight matrix A 4x4 matrix showing softmaxed attention weights between the tokens the, cat, is, and here. Cells above the diagonal are masked (causal). Darker cells mean higher weight. the cat is here the cat is here 1.00 0.35 0.65 0.20 0.25 0.55 0.15 0.20 0.10 0.55 keys (k) queries (q)
low
high softmaxed attention weights (– = causal-masked)

The division by sqrt(d) keeps the scores from becoming too large as vector size grows. Without that scaling, softmax can become too sharp too early, where one token gets almost all the weight and the others vanish.

masked_softmax does two things.

First, the mask prevents cheating. During text generation, token 3 is allowed to look at tokens 1, 2, and 3, but not token 4. Future tokens do not exist yet.

Second, softmax converts raw scores into weights that add up to 1:

raw scores -> attention weights

Then the final line computes the weighted sum:

context = weights @ v

Each row of weights says how much that token should borrow from every value vector.

Context vectors

Each row produces a weighted sum of embeddings

Context vector formation A looping animation cycling through each row of the attention weight matrix, showing the weighted sum equation and the resulting context vector bars. weights the cat is here the cat is here 1.00 0.35 0.65 0.20 0.25 0.55 0.15 0.20 0.10 0.55 weighted sum the' = 1.00Β·the context vector

This is why attention is often described as "weighted lookup" or "soft lookup." It does not choose exactly one previous token. It blends information from many tokens in different amounts.

What The MLP Does

Attention only mixes information across tokens, it's the MLP that actually do the heavy lifting and use the embeddings to predict future tokens. When asked ChatGPT

What day was Michael Jackson born

Attention doesn't have that answer because you didn't give it that information, your string doesn't contain that information so how does the LLM know the date ?

That information is contained in MLP.

You can think of the two parts like this:

A transformer block usually repeats this pattern many times:

attention -> MLP -> attention -> MLP -> ...

Early layers might capture simple local patterns. Later layers can build more abstract representations. The exact interpretation is messy, but the shape is useful: attention moves information between positions, and MLPs process the information at each position.

Eventually, the model takes the final vector at the current position and turns it into a score for every token in the vocabulary. These scores are called logits.

Feed-forward → projection

Context vectors through the MLP and into logits

MLP and logit projection Context vectors flow through a 2-layer MLP with a SiLU activation, then a linear layer projects the result to a logit distribution over the vocabulary. context vectors the' cat' is' here' (8-dim each) MLP input 8 hidden 16 output 8 linear → logits Linear logits (vocab) the cat is here sat on mat ... argmax

After softmax, logits become probabilities:

P("bird") = 0.02
P("red")  = 0.01
P(".")    = 0.31
...

The model is not directly writing English. It is repeatedly producing a probability distribution over the next token.

Sampling The Next Token

Once the model has probabilities, we need to choose a token.

The simplest choice is greedy decoding: pick the highest-probability token every time. That is deterministic, but it can become dull or get stuck in repetitive patterns.

Other sampling strategies add controlled randomness:

Sampling

Pick a token from the logit distribution

Sampling from logits A logit distribution over the vocabulary. The animation highlights each candidate token, then selects one as the output. logit distribution
sampled token → ?

After a token is sampled, it is appended to the input. Then the model runs again to predict the next token. Then again. Then again.

That is autoregressive generation:

input -> predict one token -> append it -> predict one token -> append it

End to end · autoregressive loop

Sample, append, repeat

Looped transformer pipeline A looping animation showing tokens through the full pipeline, sampling a token, appending it, and re-running the computation. append & repeat 1 tokens the cat is here 2 embeddings 8-dim vectors 3 attention softmaxed weights × embeddings → context vectors 4 MLP in SiLU out 5 logits 6 sampled and — stage 1 —

Two Missing Details

There are two details worth adding before the picture feels complete.

First, attention by itself does not know word order. If you only compare token vectors, the set ["the", "bird", "is", "red"] looks too much like ["red", "is", "bird", "the"]. Transformers add positional information to the token embeddings so the model can tell where each token appears.

Second, real transformers do this attention operation many times in parallel. This is called multi-head attention. Each head has its own learned query, key, and value projections, so different heads can specialize in different relationships. One head might track nearby syntax. Another might connect names to pronouns. Another might focus on punctuation or formatting.

The key idea does not change:

compare tokens -> make weights -> mix values

Why Scale Matters

The transformer recipe is simple, but scale changes what it can learn. A tiny transformer may learn local grammar. A large one trained on enough data can learn facts, style, code patterns, reasoning traces, translation behavior, and many other regularities in text.

GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. The Language Models are Few-Shot Learners paper showed that a sufficiently large language model could perform many tasks directly from the prompt, without task-specific fine-tuning.

That does not mean the model was explicitly taught a general reasoning algorithm. It means next-token prediction, at enough scale, forced the model to learn internal representations that are useful for many behaviors.

As Max Leiter puts it in Weights:

They are made out of weights? Yes.

That is the unsettling and beautiful part. The model is "just" matrices, vectors, nonlinearities, and probabilities. But those weights encode a huge amount of structure about language.

Ending Thought

Attention is not the whole transformer, and transformers are not the whole story of modern LLMs. There are tokenizers, positional encodings, layer normalization, residual connections, MLPs, optimizers, datasets, sampling tricks, safety layers, caches, and a lot of engineering.

But the center is still surprisingly compact:

The three lines are not the whole model, but they are the part that makes the model context-aware:

scores = (q @ k.T) / math.sqrt(d_k)
weights = masked_softmax(scores)
context = weights @ v

That is attention: every token asking, "given where I am, which other tokens matter?"

Complete Code along with comments: gitlab

Tags: #attention #transformer #LLM