Pemberton, BC

Attention in 3 lines

"They're made out of weights."

"Weights?"

"Weights. Floating-point numbers. We checked the whole thing through. It's nothing but weights."

"Weights doing what? Where do the words come from?"

Short excerpt from https://maxleiter.com/blog/weights

LLMs can feel mysterious because the finished systems are enormous. The basic transformer idea is smaller than that: turn text into numbers, let every token look at the tokens before it, and use the result to predict what comes next. A lot of their capabilities emerge from the scale of training they undergo; see why scale matters.

The important word in that sentence is look. Attention is the part of the transformer that decides which previous tokens are useful for understanding the current token i.e. which token to pay more Attention too

Text Becomes Tokens

Computers do not see words the way we do. A model receives numbers, so the first step is tokenization.

Tokenization splits text into smaller pieces called tokens. A token might be a word, part of a word, punctuation, whitespace, or some repeated text fragment. Each token gets an integer ID.

For example:

"the bird is red" -> ["the", "bird", "is", "red"]

Those token strings are then mapped to IDs:

["the", "bird", "is", "red"] -> [1820, 10214, 374, 2579]

The exact IDs do not matter here. They are just labels. If bird is token 10214, that number does not contain bird-ness. It is no more meaningful than a row number in a spreadsheet.

Tokenization → Embedding

Words become vectors

Tokens Become Vectors

Token IDs become useful only after we map them into vectors. These vectors are called embeddings.

An embedding is a list of numbers learned during training. Instead of treating a token as an isolated dictionary entry, the model stores it as a point in a learned space. Tokens used in similar contexts tend to end up near each other.

This idea became famous through word2vec. The training goal was not to manually teach the model concepts like gender, grammar, or analogy. The goal was much simpler:

given a word, predict nearby words
or given nearby words, predict the missing word

Take this sentence:

The cat sat on the mat.

If the model sees cat, it should assign higher probability to nearby words like the, sat, and maybe mat than to unrelated words. Do this over a huge amount of text, and the model starts learning that words appearing in similar neighborhoods often have related meanings.

That is why examples like this became so striking:

king - man + woman ~= queen

Embedding arithmetic

Vector directions can carry relationships

The model was not explicitly told "king is to man as queen is to woman." That rough structure emerged because the vectors were optimized to be useful for prediction.

One caveat: individual dimensions usually do not map cleanly to human concepts. Dimension 1 is not "gender", dimension 2 is not "color", and dimension 3 is not "animal-ness". In large embeddings, meaning is usually distributed across many dimensions at once.

Embedding space

Random IDs become useful neighborhoods

before training

With training, words of similiar concepts end up together

At this point each token has a vector, but each vector is still mostly context-free. The embedding for bird represents a broad idea of bird. It does not yet know whether this bird is red, flying, extinct, angry, or part of a company name.

That is where attention enters.

Attention Adds Context

Take the sentence:

the bird is red

After tokenization and embedding, the model has one vector for each token:

the   -> vector
bird  -> vector
is    -> vector
red   -> vector

But the token bird should not stay generic. In this sentence, it should absorb information from red. The representation we want is closer to:

bird = bird, but informed by "red"

A naive version would be:

bird_context =
  some_amount_of("the") +
  some_amount_of("bird") +
  some_amount_of("is") +
  some_amount_of("red")

That is Attention in plain English. Each token builds a weighted mixture of other token vectors. High weight means "this token matters to me right now." Low weight means "ignore most of this."

The result is a new vector for every token. Not just bird, but also the, is, and red. Every position gets rewritten as a context-aware representation.

The 3 Lines

Here is the core attention operation in three lines:

x contains the embedded vectors from previous step

scores = (x @ x.T) / math.sqrt(d)
weights = masked_softmax(scores)
context = weights @ x

That is the heart of it.

The dot product x @ x.T measures similarity between every pair of tokens. If two token vectors point in similar directions, their dot product is larger. Larger score means the model should pay more attention.

Attention weights

Softmaxed weights with causal masking

low

high softmaxed attention weights (– = causal-masked)

The division by sqrt(d) keeps the scores from becoming too large as vector size grows. Without that scaling, softmax can become too sharp too early, where one token gets almost all the weight and the others vanish.

masked_softmax does two things.

First, the mask prevents cheating. During text generation, token 3 is allowed to look at tokens 1, 2, and 3, but not token 4. Future tokens do not exist yet.

Second, softmax converts raw scores into weights that add up to 1:

raw scores -> attention weights

Then the final line computes the weighted sum:

context = weights @ v

Each row of weights says how much that token should borrow from every value vector.

Context vectors

Each row produces a weighted sum of embeddings

This is why attention is often described as "weighted lookup" or "soft lookup." It does not choose exactly one previous token. It blends information from many tokens in different amounts.

What The MLP Does

Attention only mixes information across tokens, it's the MLP that actually do the heavy lifting and use the embeddings to predict future tokens. When asked ChatGPT

What day was Michael Jackson born

Attention doesn't have that answer because you didn't give it that information, your string doesn't contain that information so how does the LLM know the date ?

That information is contained in MLP.

You can think of the two parts like this:

attention asks: what other tokens should I use?
the MLP asks: now that I have that context, what features should I compute?

A transformer block usually repeats this pattern many times:

attention -> MLP -> attention -> MLP -> ...

Early layers might capture simple local patterns. Later layers can build more abstract representations. The exact interpretation is messy, but the shape is useful: attention moves information between positions, and MLPs process the information at each position.

Eventually, the model takes the final vector at the current position and turns it into a score for every token in the vocabulary. These scores are called logits.

Feed-forward → projection

Context vectors through the MLP and into logits

After softmax, logits become probabilities:

P("bird") = 0.02
P("red")  = 0.01
P(".")    = 0.31
...

The model is not directly writing English. It is repeatedly producing a probability distribution over the next token.

Sampling The Next Token

Once the model has probabilities, we need to choose a token.

The simplest choice is greedy decoding: pick the highest-probability token every time. That is deterministic, but it can become dull or get stuck in repetitive patterns.

Other sampling strategies add controlled randomness:

temperature makes the distribution sharper or flatter
top-k samples only from the k most likely tokens
top-p samples from the smallest group of tokens whose probabilities add up to a chosen threshold

Sampling

Pick a token from the logit distribution

sampled token → ?

After a token is sampled, it is appended to the input. Then the model runs again to predict the next token. Then again. Then again.

That is autoregressive generation:

input -> predict one token -> append it -> predict one token -> append it

End to end · autoregressive loop

Sample, append, repeat

Two Missing Details

There are two details worth adding before the picture feels complete.

First, attention by itself does not know word order. If you only compare token vectors, the set ["the", "bird", "is", "red"] looks too much like ["red", "is", "bird", "the"]. Transformers add positional information to the token embeddings so the model can tell where each token appears.

Second, real transformers do this attention operation many times in parallel. This is called multi-head attention. Each head has its own learned query, key, and value projections, so different heads can specialize in different relationships. One head might track nearby syntax. Another might connect names to pronouns. Another might focus on punctuation or formatting.

The key idea does not change:

compare tokens -> make weights -> mix values

Why Scale Matters

The transformer recipe is simple, but scale changes what it can learn. A tiny transformer may learn local grammar. A large one trained on enough data can learn facts, style, code patterns, reasoning traces, translation behavior, and many other regularities in text.

GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. The Language Models are Few-Shot Learners paper showed that a sufficiently large language model could perform many tasks directly from the prompt, without task-specific fine-tuning.

That does not mean the model was explicitly taught a general reasoning algorithm. It means next-token prediction, at enough scale, forced the model to learn internal representations that are useful for many behaviors.

As Max Leiter puts it in Weights:

They are made out of weights? Yes.

That is the unsettling and beautiful part. The model is "just" matrices, vectors, nonlinearities, and probabilities. But those weights encode a huge amount of structure about language.

Ending Thought

Attention is not the whole transformer, and transformers are not the whole story of modern LLMs. There are tokenizers, positional encodings, layer normalization, residual connections, MLPs, optimizers, datasets, sampling tricks, safety layers, caches, and a lot of engineering.

But the center is still surprisingly compact:

turn text into tokens
turn tokens into vectors
compare tokens with attention
use the attention weights to build context vectors
turn the final vector into next-token probabilities
sample a token
repeat

The three lines are not the whole model, but they are the part that makes the model context-aware:

scores = (q @ k.T) / math.sqrt(d_k)
weights = masked_softmax(scores)
context = weights @ v

That is attention: every token asking, "given where I am, which other tokens matter?"

Complete Code along with comments: gitlab