Transformers have kicked off the AI-Boom in recent years.

Interestingly, the paper that started it all, attention is all you need, almost got overlooked by most of the research community.

Like much of Deep Learning’s revolutions, transformers simply applied a few existing concepts in a new way. Transformers are simple.

From a bird’s-eye view, they follow the architecture of autoencoders. They have an Encoder and a Decoder.

The encoder takes an input (tokens) and encodes them into an embedding space. The decoder takes the same input, as well as the encoded input, and decodes them into predictions. What makes transformers different is the way they encode and decode with attention. You will learn how this works below.

Attention

The Goal of Attention

The goal of attention is to figure out how much attention one part of the input (like a word or token) should pay to other parts (its context) to solve a task. It dynamically computes relationships between inputs, enabling the model to focus on relevant information in different contexts.

Example: In “Apple released a new iPhone,” the word “Apple” relates more to “phone” than to “fruit.” But in “I ate an apple,” it relates more to “fruit.” The attention mechanism adjusts the focus dynamically based on context.

Embedding Vectors

Because computers work with numbers, we represent words as vectors in a numeric space. This process is called embedding.

If the embedding space has dimensionalixty two, each word has two coordinates that place it in this space.

The goal of embeddings is to place similar words close to each other and dissimilar ones farther apart.

Visual Idea: Draw a 2D coordinate plane with points for “cat,” “dog” (close together), and “apple” (farther away, but near “fruit”). Label these points.

Why Not Just Use Raw Embeddings?

Input embeddings (like word2vec or GloVe) are static. They represent a word’s general meaning but don’t adapt to context. For example:

“Apple” always has the same embedding, whether it refers to a fruit or a company.

Attention mechanisms overcome this limitation by dynamically transforming inputs into representations that adapt to the context.

The attention mechanism

We start with the embedding matrix $E$ .

1. Input Transformations

Input Transformations: We transform each input vector into three new representations.
- Query ( $Q$ ): Represents what this input is “looking for.”
- Key ( $K$ ): Represents what information this input “offers.”
- Value ( $V$ ): Represents the actual content of the input.
We multiply $E$ by some weights, $W_{q}$ , $W_{v}$ , $W_{k}$ to get $Q$ , $K$ , $V$ .

2. Compute Attention Scores:

Compare the Query vector of one input with the Key vectors of all other inputs using a similarity measure (like a dot product). $Q K^{⊤}$ .

side note: in reality we also scale it, but I chose to keep it simple here We then apply the softmax function that makes sure each vector in $Q K^{⊤}$ sum to 1.

softmax (Q K^{⊤}) = A

Inside $A$ are now our attention scores, $a_{i j}$ , they tell the model how much focus input $i$ should place on each other input $j$ .

3. Combine Values:

Use attention scores to compute a weighted combination of the Value vectors:

e_{i}^{'} = Σ_{j} a_{i j} v_{j}

The resulting vector, $e_{i}$ , represents the original input $e_{i}$ transformed by its context.

Why Use Values Instead of Raw Embeddings?

The Value (V) vectors are transformations of the original embeddings. We optimize the V vectors during training to contain the most task-relevant information.

By combining values with attention scores, we generate richer, more expressive representations that adapt to both the task and the context. $A V = E^{'}$ $E^{'}$ is one attention head.

But we don’t have to stop at one.

Multiple Attention heads

The model can learn to pay attention to different aspects of each input. In text for example it could grammar, time (past or future tense), relations. Attention enables a token’s representation to adapt based on the task and the context provided by other tokens. This fleeibility is why attention mechanisms are so powerful for natural language tasks. We need different attention scores for each of those. To achieve this is simple, we just create multiple attention heads and concatenate them before we feed them into our traditional feed forward networks.

Positional Encoding

Note that attention loses all information on position.

To counter that, we create positional encodings.

Each token embedding $i$ gets encoded with the ith column uses the sine, if $i$ is even, or cosine otherwise.

p_{i} = \sin (\frac{i}{10000^{2 j / d}}), \cos (\frac{i}{10000^{2 j / d}})

This gives earlier tokens lower frequencies, and the later tokens high frequencies.

And we also get relative positioning, as every $n$ tokens apart will have the same difference as the functions are cyclical.

The encoding matrix is added to the embedding matrix $E$ before we go through the steps above.

Decoder Differences

The decoder has two added differences from the Encoder.

What we’ve seen so far, we call self-attention, because inputs pay attention to themselves. In the decoder, we use cross attention. This takes an input $x$ , does what we did before to get $e^{'}$ . But then we also take a second input $y$ . We then use $y$ to get $Q$ and $e^{'}$ to get $K$ and $V$ . The rest stays the same. This is called Cross Attention. The next difference is that the decoder has Masked self attentions. We achieve the mask by setting the upper matrix of $Q K^{⊤}$ to negative infinity which will result in 0 attention after applying the softmax.