Transformers: Architecture, Operating Principles, and Application Examples

Transformers are a deep learning architecture proposed in the article "Attention is All You Need" (2017) for processing sequential data. They have become the foundation of modern models in natural language processing (NLP), computer vision (e.g., ViT), and other fields.

Differences from Other Neural Networks:

Unlike recurrent neural networks (RNN, LSTM, GRU), transformers do not process data sequentially but analyze the entire sequence simultaneously using an attention mechanism. Unlike convolutional neural networks (CNN), transformers are not limited to local features and can account for dependencies between distant parts of the data. The core principle of transformers is the use of Self-Attention, which allows consideration of the context and importance of various elements in the input sequence.

The original transformer model consists of two main components: an encoder and a decoder. The encoder is responsible for "understanding" or "grasping the meaning" of the input text, while the decoder generates the output text.

 

Scaled Dot-Product Attention (SDPA)

Scaled Dot-Product Attention is the fundamental attention mechanism used in transformers. It calculates attention weights between elements based on the scaled dot product of Query and Key vectors, normalized with a softmax function, and then applies these weights to the Value.

\( \mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_{k}}}\right)V \)

Where:

  • Q — queries

  • K — keys

  • V — values

  • \( {d_{k}} \) — dimension of the keys

Application of SDPA: Self-Attention vs Cross-Attention

Characteristic Self-Attention Cross-Attention
What it does Each element looks at the same sequence Each element looks at a different sequence
Where it’s used In both encoder and decoder In the decoder (between the output and the encoded input)
Inputs to Q, K, V Q = K = V = input sequence Q = decoder; K = V = encoder output
Example BERT, ViT Encoder–decoder GPT model (e.g., in machine translation)
Result Each word/patch considers the context of all others Decoder receives information from the encoder (input sequence)

Multi-Head Attention (MHA)

Multi-Head Attention is a mechanism that runs multiple SDPA instances, called "heads," in parallel and combines their results. Each head operates with its own linear projections of the input data and can focus on different aspects of the information.

If there are \( h \) attention heads, then:

\( \mathrm{MHA}(Q,K,V)=\operatorname{Concat}(\mathrm{head}_{1},\dots,\mathrm{head}_{h})\,W^{O} \)

where each head is:

\( \mathrm{head}_{i}=\mathrm{Attention}(Q\,W_{i}^{Q},\,K\,W_{i}^{K},\,V\,W_{i}^{V}) \)

• \( W_{i}^{Q},\,W_{i}^{K},\,W_{i}^{V} \) — trainable projection matrices for each head.

• \( W^{O} \) — matrix that combines all heads into a single output tensor.

Combining Heads and Returning to Original Dimension:

Each attention head returns a matrix of dimension \([N, d_{h}]\), where

• \( N \) — sequence length (e.g., number of tokens),

• \( d_{h}= \frac{d_{\mathrm{model}}}{h} \) — dimension of one head.

After computing all \( h \) heads, they are concatenated along the feature dimension:

\( \operatorname{Concat}(\mathrm{head}_{1},\dots,\mathrm{head}_{h}) \in \mathbb{R}^{N \times d_{\mathrm{model}}} \)

Then a linear transformation is applied via

\( W^{O} \in \mathbb{R}^{d_{\mathrm{model}} \times d_{\mathrm{model}}} \),

which returns the result to the original dimension.

Example (training): these weights receive gradients from the loss and are updated by an optimizer (e.g., Adam or SGD).

Thus \( W_{i}^{Q},\,W_{i}^{K},\,W_{i}^{V},\,W^{O} \) are trained like any other NN layer weights.

Example:

Step 1: Two Attention Heads (3×3 each)

Let’s assume we have two heads, each returning the following matrices after attention:

• Head 1 (3×3):

\( \mathrm{head}_{1}=\begin{bmatrix} 1&2&3\\ 4&5&6\\ 7&8&9 \end{bmatrix} \)

• Head 2 (3×3):

\( \mathrm{head}_{2}=\begin{bmatrix} 9&8&7\\ 6&5&4\\ 3&2&1 \end{bmatrix} \)

Step 2: Concatenation along the feature dimension

We concatenate the matrices along the width → resulting in one matrix of size 3×6:

\( \mathrm{Concat}=\begin{bmatrix} 1&2&3&9&8&7\\ 4&5&6&6&5&4\\ 7&8&9&3&2&1 \end{bmatrix} \quad (3\times6) \)

Step 3: Applying the matrix \( W^{O} \in \mathbb{R}^{6\times3} \)

The learnable matrix \( W^{O} \) reduces the size back to 3×3.

For example, let’s assume:

\( W^{O}=\begin{bmatrix} 1&0&0\\ 0&1&0\\ 0&0&1\\ 1&1&1\\ 1&1&1\\ 1&1&1 \end{bmatrix} \)

Step 4: Matrix Multiplication (corrected).

Compute \( \mathrm{Output}=\mathrm{Concat}\cdot W^{O} \). Each entry is a dot product \( (1\times6)\cdot(6\times1) \).

• Row 1 \( r_{1}=[1,\,2,\,3,\,9,\,8,\,7] \)

\( o_{11}=r_{1}\cdot \mathbf{c}_{1}=1\cdot1+2\cdot0+3\cdot0+9\cdot1+8\cdot1+7\cdot1=1+0+0+9+8+7=25, \)

\( o_{12}=r_{1}\cdot \mathbf{c}_{2}=1\cdot0+2\cdot1+3\cdot0+9\cdot1+8\cdot1+7\cdot1=0+2+0+9+8+7=26, \)

\( o_{13}=r_{1}\cdot \mathbf{c}_{3}=1\cdot0+2\cdot0+3\cdot1+9\cdot1+8\cdot1+7\cdot1=0+0+3+9+8+7=27. \)

• Row 2 \( r_{2}=[4,\,5,\,6,\,6,\,5,\,4] \)

\( o_{21}=r_{2}\cdot \mathbf{c}_{1}=4\cdot1+5\cdot0+6\cdot0+6\cdot1+5\cdot1+4\cdot1=4+0+0+6+5+4=19, \)

\( o_{22}=r_{2}\cdot \mathbf{c}_{2}=4\cdot0+5\cdot1+6\cdot0+6\cdot1+5\cdot1+4\cdot1=0+5+0+6+5+4=20, \)

\( o_{23}=r_{2}\cdot \mathbf{c}_{3}=4\cdot0+5\cdot0+6\cdot1+6\cdot1+5\cdot1+4\cdot1=0+0+6+6+5+4=21. \)

• Row 3 \( r_{3}=[7,\,8,\,9,\,3,\,2,\,1] \)

\( o_{31}=r_{3}\cdot \mathbf{c}_{1}=7\cdot1+8\cdot0+9\cdot0+3\cdot1+2\cdot1+1\cdot1=7+0+0+3+2+1=13, \)

\( o_{32}=r_{3}\cdot \mathbf{c}_{2}=7\cdot0+8\cdot1+9\cdot0+3\cdot1+2\cdot1+1\cdot1=0+8+0+3+2+1=14, \)

\( o_{33}=r_{3}\cdot \mathbf{c}_{3}=7\cdot0+8\cdot0+9\cdot1+3\cdot1+2\cdot1+1\cdot1=0+0+9+3+2+1=15. \)

Step 5: Final result

\( \mathrm{Output}=\begin{bmatrix} 25&26&27\\ 19&20&21\\ 13&14&15 \end{bmatrix} \).

Matrix What it does Initialization Updates during training?
WiQ,  WiK,  WiV Build Query, Key, and Value representations per head random yes
WO Combines outputs of all heads into a single tensor random yes

During transformer training (via backpropagation):

• These weights receive gradients based on the loss,

• They are updated by the optimizer (e.g., Adam, SGD).

This means that the matrices \( W_{i}^{Q},\, W_{i}^{K},\, W_{i}^{V},\, W^{O} \) are trained just like any other neural network weights.


Example of Transformer Operation

Our goal is to translate "Hello World" from English to Spanish. The example "Hello World" will be split into tokens "Hello" and "World". Each token is assigned an ID in the model’s vocabulary. For example, "Hello" might be token 1, and "World" might be token 2.

1. Text Embedding

Token embeddings map a token ID to a vector of fixed length with semantic meaning for the tokens. This creates interesting properties: similar tokens will have similar embeddings (in other words, computing the cosine similarity between two embeddings will give a good understanding of the degree of token similarity). All embeddings in one model have the same size. In the original transformer article, a size of 512 was used, but to make computations manageable, we will reduce this size to 4.

Let the vocabulary be as follows:

Now we assign the embeddings manually (normally they are learned, but we’ll set them):

  • Hello (ID=1) → embedding: [0.1, 0.3, 0.5, 0.7]

  • World (ID=2) → embedding: [0.2, 0.4, 0.6, 0.8]

Token ID
Hello 1
World 2
 

2. Positional Encoding (sin/cos)

Positions:

• “Hello” — position 0

• “World” — position 1

Embedding dimension \( d_{\text{model}}=4 \). We create positional codes for each position using:

\( \mathrm{PE}(pos,2i)=\sin\!\left(\dfrac{pos}{10000^{\,2i/d_{\text{model}}}}\right) \), \( \mathrm{PE}(pos,2i+1)=\cos\!\left(\dfrac{pos}{10000^{\,2i/d_{\text{model}}}}\right) \).

Manual calculation (rounded to two decimals):

For \( d_{\text{model}}=4 \) we have \( i\in\{0,1\} \), so the scales are \( 10000^{0}=1 \) and \( 10000^{0.5}=100 \).

• pos = 0

• \( \mathrm{PE}(0,0)=\sin(0/1)=0.00 \)

• \( \mathrm{PE}(0,1)=\cos(0/1)=1.00 \)

• \( \mathrm{PE}(0,2)=\sin(0/100)=0.00 \)

• \( \mathrm{PE}(0,3)=\cos(0/100)=1.00 \)

→ vector: [0.00, 1.00, 0.00, 1.00]

• pos = 1

• \( \mathrm{PE}(1,0)=\sin(1/1)\approx 0.84 \)

• \( \mathrm{PE}(1,1)=\cos(1/1)\approx 0.54 \)

• \( \mathrm{PE}(1,2)=\sin(1/100)\approx 0.01 \)

• \( \mathrm{PE}(1,3)=\cos(1/100)\approx 1.00 \) (exactly \( \cos(0.01)\approx 0.99995 \), which rounds to 1.00)

→ vector: [0.84, 0.54, 0.01, 1.00]

3. Adding embeddings and positional encoding

Now we simply add the corresponding elements:

  • Hello (position 0):
    [0.1, 0.3, 0.5, 0.7] + [0.00, 1.00, 0.00, 1.00] = [0.10, 1.30, 0.50, 1.70]

  • World (position 1):
    [0.2, 0.4, 0.6, 0.8] + [0.84, 0.54, 0.01, 1.00] = [1.04, 0.94, 0.61, 1.80]

Result after 3 steps:

2. Positional Encoding (sin/cos)

Positions:

• “Hello” — position 0

• “World” — position 1

Embedding dimension \( d_{\text{model}}=4 \). We create positional codes for each position using:

\( \mathrm{PE}(pos,2i)=\sin\!\left(\dfrac{pos}{10000^{\,2i/d_{\text{model}}}}\right) \), \( \mathrm{PE}(pos,2i+1)=\cos\!\left(\dfrac{pos}{10000^{\,2i/d_{\text{model}}}}\right) \).

Manual calculation (rounded to two decimals):

For \( d_{\text{model}}=4 \) we have \( i\in\{0,1\} \), so the scales are \( 10000^{0}=1 \) and \( 10000^{0.5}=100 \).

• pos = 0

• \( \mathrm{PE}(0,0)=\sin(0/1)=0.00 \)

• \( \mathrm{PE}(0,1)=\cos(0/1)=1.00 \)

• \( \mathrm{PE}(0,2)=\sin(0/100)=0.00 \)

• \( \mathrm{PE}(0,3)=\cos(0/100)=1.00 \)

→ vector: [0.00, 1.00, 0.00, 1.00]

• pos = 1

• \( \mathrm{PE}(1,0)=\sin(1/1)\approx 0.84 \)

• \( \mathrm{PE}(1,1)=\cos(1/1)\approx 0.54 \)

• \( \mathrm{PE}(1,2)=\sin(1/100)\approx 0.01 \)

• \( \mathrm{PE}(1,3)=\cos(1/100)\approx 1.00 \) (exactly \( \cos(0.01)\approx 0.99995 \), which rounds to 1.00)

→ vector: [0.84, 0.54, 0.01, 1.00]

3. Adding embeddings and positional encoding

Now we simply add the corresponding elements:

  • Hello (position 0):
    [0.1, 0.3, 0.5, 0.7] + [0.00, 1.00, 0.00, 1.00] = [0.10, 1.30, 0.50, 1.70]

  • World (position 1):
    [0.2, 0.4, 0.6, 0.8] + [0.84, 0.54, 0.01, 1.00] = [1.04, 0.94, 0.61, 1.80]

Result after 3 steps:

Token Embedding Positional encoding Sum
Hello [0.1, 0.3, 0.5, 0.7] [0.00, 1.00, 0.00, 1.00] [0.10, 1.30, 0.50, 1.70]
World [0.2, 0.4, 0.6, 0.8] [0.84, 0.54, 0.01, 1.00] [1.04, 0.94, 0.61, 1.80]

4. Self-Attention

Now we introduce the concept of multi-head attention. Attention is a mechanism that lets the model focus on specific parts of the input. Multi-head attention allows the model to attend to information from different subspaces jointly by using multiple attention heads. Each head has its own matrices K, V, and Q.

In this example we will use two attention heads. We assign random values to their matrices. Each matrix has shape 4×3, which projects 4-dimensional embeddings into 3-dimensional keys (K), values (V), and queries (Q). This reduces the dimensionality of the attention mechanism and helps control computational cost. Note that making this dimension too small can hurt model accuracy. We will use the following (arbitrary) values:

Thus the input matrix \( E \in \mathbb{R}^{2\times4} \).

\( E=\begin{bmatrix} 0.10&1.30&0.50&1.70\\ 1.04&0.94&0.61&1.80 \end{bmatrix} \)

Head 1:

\( W^{Q}_{1}=\begin{bmatrix} 0.2&-0.3&0.5\\ 0.7&0.2&-0.6\\ -0.1&0.4&0.3\\ 0.3&0.1&0.2 \end{bmatrix} \) \( W^{K}_{1}=\begin{bmatrix} 0.4&-0.2&0.1\\ 0.1&0.3&0.6\\ -0.5&0.2&-0.3\\ 0.2&0.1&0.4 \end{bmatrix} \) \( W^{V}_{1}=\begin{bmatrix} 0.3&0.5&0.1\\ -0.4&0.2&0.6\\ 0.2&0.1&-0.2\\ 0.7&-0.3&0.4 \end{bmatrix} \)

Head 2 (different weights):

\( W^{Q}_{2}=\begin{bmatrix} -0.2&0.6&0.1\\ 0.3&-0.1&0.5\\ 0.4&0.2&-0.3\\ 0.1&0.4&0.2 \end{bmatrix} \) \( W^{K}_{2}=\begin{bmatrix} 0.5&0.1&-0.3\\ 0.2&-0.4&0.6\\ -0.1&0.5&0.3\\ 0.3&0.2&-0.2 \end{bmatrix} \) \( W^{V}_{2}=\begin{bmatrix} 0.6&-0.2&0.5\\ 0.1&0.4&-0.1\\ 0.3&0.2&0.1\\ -0.5&0.3&0.4 \end{bmatrix} \)

Computing the Q, K, and V matrices

For each head we multiply:

\( Q = E \cdot W^{Q},\quad K = E \cdot W^{K},\quad V = E \cdot W^{V} \).

Compute for the first row \( E_{0} = [0.10,\, 1.30,\, 0.50,\, 1.70] \):

Elements

• \( Q_{1}[0,0] \):

\( 0.10\cdot0.2 + 1.30\cdot0.7 + 0.50\cdot(-0.1) + 1.70\cdot0.3 = 0.02 + 0.91 - 0.05 + 0.51 = 1.39 \).

• \( Q_{1}[0,1] \):

\( 0.10\cdot(-0.3) + 1.30\cdot0.2 + 0.50\cdot0.4 + 1.70\cdot0.1 = -0.03 + 0.26 + 0.20 + 0.17 = 0.60 \).

• \( Q_{1}[0,2] \):

\( 0.10\cdot0.5 + 1.30\cdot(-0.6) + 0.50\cdot0.3 + 1.70\cdot0.2 = 0.05 - 0.78 + 0.15 + 0.34 = -0.24 \).

Compute for the second row \( E_{1} = [1.04,\, 0.94,\, 0.61,\, 1.80] \):

• \( Q_{1}[1,0] \):

\( 1.04\cdot0.2 + 0.94\cdot0.7 + 0.61\cdot(-0.1) + 1.80\cdot0.3 = 0.208 + 0.658 - 0.061 + 0.540 = 1.345 \).

• \( Q_{1}[1,1] \):

\( 1.04\cdot(-0.3) + 0.94\cdot0.2 + 0.61\cdot0.4 + 1.80\cdot0.1 = -0.312 + 0.188 + 0.244 + 0.180 = 0.300 \).

• \( Q_{1}[1,2] \):

\( 1.04\cdot0.5 + 0.94\cdot(-0.6) + 0.61\cdot0.3 + 1.80\cdot0.2 = 0.520 - 0.564 + 0.183 + 0.360 = 0.499 \).

Thus:

\( Q_{1}=\begin{bmatrix} 1.390&0.600&-0.240\\ 1.345&0.300&0.499 \end{bmatrix} \)

Final results

• Head 1:

\( Q_{1}=\begin{bmatrix} 1.390&0.600&-0.240\\ 1.345&0.300&0.499 \end{bmatrix} \), \( K_{1}=\begin{bmatrix} 0.260&0.640&1.320\\ 0.565&0.376&1.205 \end{bmatrix} \), \( V_{1}=\begin{bmatrix} 0.800&-0.150&1.370\\ 1.318&0.229&1.266 \end{bmatrix} \)

• Head 2:

\( Q_{2}=\begin{bmatrix} 0.740&0.710&0.850\\ 0.498&1.372&0.751 \end{bmatrix} \), \( K_{2}=\begin{bmatrix} 0.770&0.080&0.560\\ 1.187&0.393&0.075 \end{bmatrix} \), \( V_{2}=\begin{bmatrix} -0.510&1.110&0.650\\ 0.001&0.830&1.207 \end{bmatrix} \)

\( \mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d}}\right)V \)


Self-Attention


1. Query and Key matrices:

\( Q_{2}=\begin{bmatrix} 0.740&0.710&0.850\\ 0.498&1.372&0.751 \end{bmatrix} \),

\( K_{2}=\begin{bmatrix} 0.770&0.080&0.560\\ 1.187&0.393&0.075 \end{bmatrix} \)

2. Dot products (before scaling):

\( Q_{2} \cdot K_{2}^{\top}=\begin{bmatrix} 1.1026&1.2212\\ 0.9138&1.1866 \end{bmatrix} \)

3. Scaling (divide by \( \sqrt{d_{k}}=\sqrt{3}\approx 1.732 \)):

\( \dfrac{Q_{2}\cdot K_{2}^{\top}}{\sqrt{3}}=\begin{bmatrix} 0.637&0.705\\ 0.528&0.685 \end{bmatrix} \)

4. Apply softmax:

Formula: \( \operatorname{softmax}(x_{i})=\dfrac{e^{x_{i}}}{\sum_{j}e^{x_{j}}} \).

Row-wise (per query): \( \boldsymbol{\alpha}_{2}=\begin{bmatrix} 0.483&0.517\\ 0.461&0.539 \end{bmatrix} \).

5. Value matrix:

\( V_{2}=\begin{bmatrix} -0.510&1.110&0.650\\ 0.001&0.830&1.207 \end{bmatrix} \)

6. Attention computation (weighted sum):

Formula: \( Z_{2}=\boldsymbol{\alpha}_{2}\cdot V_{2} \)

Using \( \boldsymbol{\alpha}_{2}=\begin{bmatrix} 0.483&0.517\\ 0.461&0.539 \end{bmatrix} \) (from Step 4), the result is

\( Z_{2}=\begin{bmatrix} -0.246&0.965&0.938\\ -0.234&0.959&0.950 \end{bmatrix} \).

1. Query and Key matrices:

\( Q_{1}=\begin{bmatrix} 1.390&0.600&-0.240\\ 1.345&0.300&0.499 \end{bmatrix} \),

\( K_{1}=\begin{bmatrix} 0.260&0.640&1.320\\ 0.565&0.376&1.205 \end{bmatrix} \)

2. Dot products (before scaling):

\( Q_{1}\cdot K_{1}^{\top}=\begin{bmatrix} 0.4286&0.7218\\ 1.2004&1.4740 \end{bmatrix} \)


3. Scaling (divide by \( \sqrt{d_{k}}=\sqrt{3}\approx 1.732 \)):

\( \dfrac{Q_{1}\cdot K_{1}^{\top}}{\sqrt{3}}=\begin{bmatrix} 0.247&0.417\\ 0.693&0.851 \end{bmatrix} \)


4. Apply softmax:

Formula: \( \operatorname{softmax}(x_{i})=\dfrac{e^{x_{i}}}{\sum_{j}e^{x_{j}}} \).

Row-wise (per query): \( \boldsymbol{\alpha}_{1}=\begin{bmatrix} 0.458&0.542\\ 0.461&0.539 \end{bmatrix} \).


5. Value matrix:

\( V_{1}=\begin{bmatrix} 0.800&-0.150&1.370\\ 1.318&0.229&1.266 \end{bmatrix} \)


6. Attention computation (weighted sum):

Formula: \( Z_{1}=\boldsymbol{\alpha}_{1}\cdot V_{1} \)

Using \( \boldsymbol{\alpha}_{1}=\begin{bmatrix} 0.458&0.542\\ 0.461&0.539 \end{bmatrix} \) (from Step 4), the result is

\( Z_{1}=\begin{bmatrix} 1.081&0.055&1.314\\ 1.079&0.054&1.314 \end{bmatrix} \).


 


 

1. Input: result of Multi-Head Attention

\( X=\begin{bmatrix} 1.779&0.951&1.277&1.974\\ 1.784&0.961&1.281&1.981 \end{bmatrix} \) ( \(2\times4\) )

2. First linear layer \( \mathbb{R}^{4\times8} \) + bias

Weight matrix \( W_{1} \):

\( W_{1}=\begin{bmatrix} 0.2&0.1&0.3&0.4&0.5&0.6&0.7&0.8\\ 0.3&0.5&0.2&0.6&0.4&0.2&0.1&0.3\\ 0.4&0.3&0.5&0.1&0.6&0.7&0.2&0.2\\ 0.1&0.6&0.4&0.2&0.3&0.3&0.6&0.5 \end{bmatrix} \)

Bias vector \( \mathbf{b}_{1}=[\,0.1,\,0.2,\,0.3,\,0.1,\,0.0,\,-0.1,\,-0.2,\,-0.3\,] \)

\( X\cdot W_{1}+\mathbf{b}_{1}=\begin{bmatrix} 1.449&2.421&2.452&1.905&2.628&2.644&2.580&2.651\\ 1.455&2.431&2.460&1.914&2.639&2.653&2.589&2.662 \end{bmatrix} \)

3. ReLU activation

Formula: \( \operatorname{ReLU}(x)=\max(0,x) \)

All entries are positive, so ReLU does not change the result:

\( \operatorname{ReLU}(X\cdot W_{1}+\mathbf{b}_{1})=\begin{bmatrix} 1.449&2.421&2.452&1.905&2.628&2.644&2.580&2.651\\ 1.455&2.431&2.460&1.914&2.639&2.653&2.589&2.662 \end{bmatrix} \)

4. Second linear layer \( \mathbb{R}^{8\times4} \) + bias

Weight matrix \( W_{2} \):

\( W_{2}=\begin{bmatrix} 0.1&0.4&0.7&1.0\\ 0.3&0.6&0.2&0.9\\ 0.5&0.1&0.9&0.3\\ 0.7&0.2&0.3&0.8\\ 0.9&0.8&0.1&0.6\\ 0.2&0.5&0.4&0.7\\ 0.6&0.9&0.8&0.2\\ 0.4&0.3&0.5&0.1 \end{bmatrix} \)

Bias vector \( \mathbf{b}_{2}=[\,0.2,\,-0.1,\,0.3,\,0.0\,] \)

\( \mathrm{ReLU\ Output}\cdot W_{2}+\mathbf{b}_{2}=\begin{bmatrix} 9.133&9.100&9.287&10.096\\ 9.169&9.136&9.321&10.137 \end{bmatrix} \)

Final (shape \(2\times4\)): \( \mathrm{FFN\_Output}=\begin{bmatrix} 9.133&9.100&9.287&10.096\\ 9.169&9.136&9.321&10.137 \end{bmatrix} \)

1. Attention outputs from two heads

Head 1 (Z₁):

\( Z_{1}=\begin{bmatrix} 1.081&0.055&1.314\\ 1.079&0.054&1.314 \end{bmatrix} \)


Head 2 (Z₂):

\( Z_{2}=\begin{bmatrix} -0.246&0.965&0.938\\ -0.234&0.959&0.950 \end{bmatrix} \)


2. Concatenation along the feature dimension

We concatenate \( Z_{1} \) and \( Z_{2} \) along the width:

\( \operatorname{Concat}(Z_{1},Z_{2})=\begin{bmatrix} 1.081&0.055&1.314&-0.246&0.965&0.938\\ 1.079&0.054&1.314&-0.234&0.959&0.950 \end{bmatrix} \ (2\times6) \)


3. Multiplication by the trainable matrix \( W^{O} \)

Dimensions:

• Input: \( 2\times6 \)

• \( W^{O}\in\mathbb{R}^{6\times4} \)

• Result: \( 2\times4 \) — same as the embedding dimension

Assume (for example) the weight matrix:

\( W^{O}=\begin{bmatrix} 0.2&0.1&0.3&0.4\\ 0.3&0.5&0.2&0.6\\ 0.4&0.3&0.5&0.1\\ 0.1&0.6&0.4&0.2\\ 0.5&0.2&0.3&0.7\\ 0.6&0.4&0.1&0.8 \end{bmatrix} \)


4. Final attention layer output

Formula: \( \mathrm{MHA\_Output}=\operatorname{Concat}(Z_{1},Z_{2})\cdot W^{O} \)

Result:

\( \mathrm{MHA\_Output}=\begin{bmatrix} 1.779&0.951&1.277&1.974\\ 1.784&0.961&1.281&1.981 \end{bmatrix} \)



Feed-Forward Layer (FFN)

After the self-attention layer, the encoder applies a feed-forward neural network (FFN). It is a simple two-layer MLP with a ReLU activation in between. The goal of the FFN is to transform the representation produced by attention. The flow is typically:

1. First linear layer: usually expands the hidden size. For example, if the input hidden size is 4, the intermediate (expanded) size might be 8. This expansion allows the model to learn richer nonlinear functions. In our small example with size 4, we expand to 8.

2. ReLU activation: a nonlinear activation that returns 0 for negative inputs and the input itself for positive inputs. This enables the model to learn nonlinear behaviors.

3. Second linear layer: projects back down to the original hidden size. In our example it reduces the size from 8 back to 4.

ReLU equation: \( \operatorname{ReLU}(x)=\max(0,x) \) Overall FFN mapping: \( \mathrm{FFN}(x)=\operatorname{ReLU}(xW_{1}+b_{1})\,W_{2}+b_{2} \)



Problem of exploding values. During training (backpropagation), gradients can become too large and “explode”. Without normalization, small changes in early layers can get amplified in deeper layers. Two common fixes are residual connections and layer normalization.

Residual connections simply add the layer input to its output (e.g., add the original embedding to the attention output). If the gradient becomes too small, this skip path helps it flow:

\( \mathrm{Residual}(x)=x+\mathrm{Layer}(x) \)

Layer normalization normalizes along the embedding dimension so the layer’s inputs have mean 0 and standard deviation 1. This improves gradient flow:

\( \mathrm{LayerNorm}(x)=\frac{x-\mu}{\sqrt{\sigma^{2}+\varepsilon}}\times \gamma+\beta \)

Parameters:

- \( \mu \) — mean of the embedding.

- \( \sigma \) — standard deviation of the embedding.

- \( \varepsilon \) — small constant to avoid division by zero.

- \( \gamma \) and \( \beta \) — learnable scale and shift.

Unlike batch normalization, layer normalization works over the embedding dimension for each sample independently (other samples in the batch don’t affect it). The idea is to normalize each token’s features to mean 0 and std 1.

Why do we add learnable \( \gamma \) and \( \beta \)? If we only normalize, we might reduce the representational power of the layer. With learnable scale and shift, the model can re-scale and re-center the normalized values if that helps.

Combining these equations, we obtain the encoder:

\( Z(x)=\mathrm{LayerNorm}(x+\mathrm{Attention}(x)) \)

\( \mathrm{FFN}(x)=\operatorname{ReLU}(xW_{1}+b_{1})\,W_{2}+b_{2} \)

\( \mathrm{Encoder}(x)=\mathrm{LayerNorm}(Z(x)+\mathrm{FFN}(Z(x))) \)



1. Input: result of Multi-Head Attention (also denoted x)

\( x=\begin{bmatrix} 1.779&0.951&1.277&1.974\\ 1.784&0.961&1.281&1.981 \end{bmatrix} \) ( \(2\times4\) )



2. Residual connection and attention normalization

Formula: \( Z(x)=\mathrm{LayerNorm}\!\big(x+\mathrm{Attention}(x)\big) \)

Since \( \mathrm{Attention}(x)=x \), we get \( Z(x)=\mathrm{LayerNorm}(x+x)=\mathrm{LayerNorm}(2x) \).

\( 2x=\begin{bmatrix} 3.558&1.901&2.554&3.948\\ 3.568&1.921&2.561&3.961 \end{bmatrix} \)



Layer Normalization:

• For each row, compute the mean and standard deviation.

• Apply: \( \mathrm{LayerNorm}(v)=\dfrac{v-\mu}{\sigma+\varepsilon} \)



Final:

\( Z(x)=\begin{bmatrix} 0.702&-1.347&-0.539&1.184\\ 0.701&-1.341&-0.548&1.188 \end{bmatrix} \)



3. Feed-Forward Layer

Formula: \( \mathrm{FFN}(x)=\operatorname{ReLU}(xW_{1}+b_{1})\,W_{2}+b_{2} \)



3.1 First linear layer \( W_{1}\in\mathbb{R}^{4\times8},\ b_{1}\in\mathbb{R}^{8} \)

\( Z(x)\cdot W_{1}+b_{1}=\begin{bmatrix} -0.2609&0.1454&0.4452&-0.2444&-0.1560&0.0296&0.7594&0.3418\\ -0.2624&0.1480&0.4434&-0.2415&-0.1582&0.0253&0.7596&0.3427 \end{bmatrix} \)



3.2 ReLU

\( \operatorname{ReLU}(Z(x)W_{1}+b_{1})=\max(0,\cdot)=\begin{bmatrix} 0.0000&0.1454&0.4452&0.0000&0.0000&0.0296&0.7594&0.3418\\ 0.0000&0.1480&0.4434&0.0000&0.0000&0.0253&0.7596&0.3427 \end{bmatrix} \)



3.3 Second linear layer \( W_{2}\in\mathbb{R}^{8\times4},\ b_{2}\in\mathbb{R}^{4} \)

\( \mathrm{FFN}(Z(x))=\operatorname{ReLU}(\cdot)\cdot W_{2}+b_{2}=\begin{bmatrix} 1.0645&0.8326&1.5201&0.4712\\ 1.0640&0.8322&1.5177&0.4701 \end{bmatrix} \)



4. Second residual connection and LayerNorm

Formula:

\( \mathrm{Encoder}(x)=\mathrm{LayerNorm}\!\big(Z(x)+\mathrm{FFN}(Z(x))\big) \)

\( Z(x)+\mathrm{FFN}(Z(x))=\begin{bmatrix} 1.7667&-0.5143&0.9808&1.6552\\ 1.7646&-0.5088&0.9702&1.6579 \end{bmatrix} \)



Apply LayerNorm row-wise:

\( \mathrm{Encoder}(x)=\begin{bmatrix} 0.8738&-1.6346&0.0096&0.7513\\ 0.8749&-1.6313&-0.0009&0.7573 \end{bmatrix} \)



Final encoder layer output:

\( \mathrm{Encoder}(x)=\begin{bmatrix} 0.8738&-1.6346&0.0096&0.7513\\ 0.8749&-1.6313&-0.0009&0.7573 \end{bmatrix} \)


Decoder

Most of the knowledge learned by the encoder is also used in the decoder. The decoder contains two attention sublayers — masked self-attention (over the decoder’s own partial output) and cross-attention over the encoder outputs — plus a feed-forward layer. Let’s go through the pieces.

The decoder block takes two inputs:

• The encoder outputs — a representation (context) of the input sequence.

• The generated output sequence so far.

At inference time, the generated output starts with a special start-of-sequence token SOS. During training, the target output is the ground-truth sequence shifted by one position (teacher forcing).

Given the encoder’s embedding (context) and the token SOS, the decoder generates the next token of the sequence (autoregressive behavior: it uses previously generated tokens to produce the next one):

• Iteration 1: input — SOS, output — “hola”

• Iteration 2: input — SOS + “hola”, output — “mundo”

• Iteration 3: input — SOS + “hola” + “mundo”, output — EOS

Here SOS is the start-of-sequence token, and EOS is the end-of-sequence token. After generating EOS, the decoder stops. It produces one token at a time. Note that in every iteration the same encoder-produced context is used.

Like the encoder, the decoder is a stack of decoder blocks. A decoder block is slightly more complex than an encoder block. Its overall structure is:

1. Masked self-attention layer

2. Residual connection & layer normalization

3. Encoder–decoder attention (cross-attention) layer

4. Residual connection & layer normalization

5. Feed-forward layer

6. Residual connection & layer normalization


1. Text embedding

The decoder first embeds the input tokens. The first input token is the start-of-sequence token **SOS**, so we embed it using the same embedding size as in the encoder. Assume the embedding vector for SOS is:

\( \mathrm{Emb}_{\mathrm{SOS}}=[0.5,\,-0.4,\,1.2,\,0.8] \)

2. Positional encoding

Now we add positional encoding to the embedding, same as for the encoder.

Embedding dimension: \( d=4 \)

Position: \( \mathrm{pos}=0 \)

Formulas:

\( \mathrm{PE}(\mathrm{pos},2i)=\sin\!\left(\dfrac{\mathrm{pos}}{10000^{\,2i/d}}\right) \), \( \mathrm{PE}(\mathrm{pos},2i+1)=\cos\!\left(\dfrac{\mathrm{pos}}{10000^{\,2i/d}}\right) \)

Plugging in \( \mathrm{pos}=0 \):

• \( \mathrm{PE}(0,0)=\sin(0/10000^{0})=\sin(0)=0 \)

• \( \mathrm{PE}(0,1)=\cos(0/10000^{0})=\cos(0)=1 \)

• \( \mathrm{PE}(0,2)=\sin(0/10000^{0.5})=\sin(0)=0 \)

• \( \mathrm{PE}(0,3)=\cos(0/10000^{0.5})=\cos(0)=1 \)

Resulting vector:

\( \mathrm{PE}(0)=[0,\,1,\,0,\,1] \)

3. Adding positional encoding and embedding

We add the positional encoding to the embedding by vector addition.

Sum of embedding and positional encoding:

\( x_{0}=\mathrm{Emb}_{\mathrm{SOS}}+\mathrm{PE}(0)=[0.5,\,-0.4,\,1.2,\,0.8]+[0,\,1,\,0,\,1]=[0.5,\,0.6,\,1.2,\,1.8] \)

4. Attention weight matrices

For each head we define three matrices:

• \( W^{Q}\in\mathbb{R}^{4\times3} \)

• \( W^{K}\in\mathbb{R}^{4\times3} \)

• \( W^{V}\in\mathbb{R}^{4\times3} \)

Head 1:

\( W^{Q}_{1}=\begin{bmatrix} 0.329&-0.073&0.430\\ 0.237&-0.487&0.571\\ 0.313&0.343&-0.446\\ -0.060&-0.155&0.512 \end{bmatrix} \), \( W^{K}_{1}=\begin{bmatrix} 0.334&-0.366&-0.040\\ -0.547&-0.415&0.220\\ 0.294&0.561&-0.209\\ -0.155&-0.037&-0.373 \end{bmatrix} \), \( W^{V}_{1}=\begin{bmatrix} -0.444&-0.029&-0.328\\ 0.204&-0.075&0.399\\ 0.240&-0.225&0.399\\ 0.366&-0.135&-0.254 \end{bmatrix} \)

Head 2:

\( W^{Q}_{2}=\begin{bmatrix} 0.199&-0.383&-0.953\\ 0.240&-0.819&-0.795\\ -0.065&0.779&-0.631\\ 0.114&0.228&-0.471 \end{bmatrix} \), \( W^{K}_{2}=\begin{bmatrix} -0.457&0.538&-0.315\\ 0.346&0.317&0.584\\ -0.097&-0.520&-0.457\\ 0.415&0.293&0.378 \end{bmatrix} \), \( W^{V}_{2}=\begin{bmatrix} -0.439&0.080&0.135\\ 0.271&0.358&0.313\\ -0.205&0.414&-0.591\\ -0.191&0.330&0.422 \end{bmatrix} \)

Computing the Q, K, and V matrices

For each head we compute: \( Q=x_{0}\cdot W^{Q},\quad K=x_{0}\cdot W^{K},\quad V=x_{0}\cdot W^{V} \).

Input vector: \( x_{0}=[0.5,\,0.6,\,1.2,\,1.8] \).

Element-wise demo (Head 1, using \( W^{Q}_{1} \)):

• Element \( Q_{1}[0] \):

\( 0.5\cdot0.329 + 0.6\cdot0.237 + 1.2\cdot0.313 + 1.8\cdot(-0.060) = 0.1645 + 0.1422 + 0.3756 - 0.108 = 0.575 \).

• Element \( Q_{1}[1] \):

\( 0.5\cdot(-0.073) + 0.6\cdot(-0.487) + 1.2\cdot0.343 + 1.8\cdot(-0.155) = -0.0365 - 0.2922 + 0.4116 - 0.279 = -0.196 \).

• Element \( Q_{1}[2] \):

\( 0.5\cdot0.430 + 0.6\cdot0.571 + 1.2\cdot(-0.446) + 1.8\cdot0.512 = 0.215 + 0.3426 - 0.5352 + 0.9216 = 0.944 \).

Final vectors

• Head 1:

\( Q_{1}=[0.575,\,-0.196,\,0.944] \), \( K_{1}=[-0.089,\,0.175,\,-0.810] \), \( V_{1}=[0.847,\,-0.573,\,0.097] \).

• Head 2:

\( Q_{2}=[0.199,\,-0.383,\,-0.953] \), \( K_{2}=[0.240,\,-0.819,\,-0.795] \), \( V_{2}=[-0.065,\,0.779,\,-0.631] \).

Attention (formula)

\( \mathrm{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{Q\cdot K^{\top}}{\sqrt{d_k}}\right)\cdot V \)

Where \( d_k=3 \) and \( \sqrt{3}\approx 1.732 \).

Self-Attention — Head 1

\( Q_{1}=[\,0.575,\,-0.196,\,0.944\,] \), \( K_{1}=[\,-0.089,\,0.175,\,-0.810\,] \), \( V_{1}=[\,0.847,\,-0.573,\,0.097\,] \)

Dot product (before scaling): \( Q_{1}\cdot K_{1}^{\top}=0.575\cdot(-0.089)+(-0.196)\cdot0.175+0.944\cdot(-0.810)=-0.850 \)

Scaling: \( \dfrac{-0.850}{\sqrt{3}}\approx\dfrac{-0.850}{1.732}=-0.491 \)

Softmax (single element → always 1): \( \boldsymbol{\alpha}_{1}=\operatorname{softmax}([\,-0.491\,])=[\,1\,] \)

Weighted value: \( z_{1}=\boldsymbol{\alpha}_{1}\cdot V_{1}=[\,0.847,\,-0.573,\,0.097\,] \)

Self-Attention — Head 2

\( Q_{2}=[\,0.199,\,-0.383,\,-0.953\,] \), \( K_{2}=[\,0.240,\,-0.819,\,-0.795\,] \), \( V_{2}=[\,-0.065,\,0.779,\,-0.631\,] \)

Dot product (before scaling): \( Q_{2}\cdot K_{2}^{\top}=0.199\cdot0.240+(-0.383)\cdot(-0.819)+(-0.953)\cdot(-0.795)=1.121 \)

Scaling: \( \dfrac{1.121}{\sqrt{3}}\approx\dfrac{1.121}{1.732}=0.647 \)

Softmax (single element → always 1): \( \boldsymbol{\alpha}_{2}=\operatorname{softmax}([\,0.647\,])=[\,1\,] \)

Weighted value: \( z_{2}=\boldsymbol{\alpha}_{2}\cdot V_{2}=[\,-0.065,\,0.779,\,-0.631\,] \)

Concatenation along the feature dimension

We concatenate \( Z_{1} \) and \( Z_{2} \) along the width:

\( \operatorname{Concat}(Z_{1},Z_{2})=[0.847,\,-0.573,\,0.097,\,-0.065,\,0.779,\,-0.631] \) ( \(1\times6\) )

Multiplication by the trainable matrix \( W^{O} \)

Dimensions: input \(1\times6\); \( W^{O}\in\mathbb{R}^{6\times4} \); result \(1\times4\).

Weight matrix:

\( W^{O}=\begin{bmatrix} 0.1&0.2&0.3&0.4\\ 0.2&0.3&0.4&0.5\\ 0.3&0.4&0.5&0.6\\ 0.4&0.5&0.6&0.7\\ 0.5&0.6&0.7&0.8\\ 0.6&0.7&0.8&0.9 \end{bmatrix} \)

Final attention-layer output

Formula: \( \mathrm{MHA\_Output}=\operatorname{Concat}(Z_{1},Z_{2})\cdot W^{O} \)

Result: \( \mathrm{MHA\_Output}=[-0.016,\,0.029,\,0.074,\,0.120] \)

Input: Multi-Head Attention output

\( x=[0.5,\,0.6,\,1.2,\,1.8] \)

\( \mathrm{MHA}(x)=[-0.016,\,0.029,\,0.074,\,0.120] \)

Residual Connection and LayerNorm

Residual add: \( z=x+\mathrm{MHA}(x)=[\,0.5-0.016,\ 0.6+0.029,\ 1.2+0.074,\ 1.8+0.120\,]=[\,0.4837,\ 0.6290,\ 1.2743,\ 1.9196\,] \)

LayerNorm formula (per vector element): \( \mathrm{LayerNorm}(z)_i=\dfrac{z_i-\mu}{\sqrt{\sigma^{2}+\varepsilon}} \)

Where: \( \mu=\tfrac{1}{4}\sum z_i=\dfrac{0.4837+0.6290+1.2743+1.9196}{4}=1.0767 \), \( \sigma^{2}=\tfrac{1}{4}\sum(z_i-\mu)^{2}=0.2937 \Rightarrow \sigma\approx0.5419 \)

Normalized output: \( \mathrm{LayerNorm}(z)=\Big[\,\dfrac{0.4837-1.0767}{0.5419},\ \dfrac{0.6290-1.0767}{0.5419},\ \dfrac{1.2743-1.0767}{0.5419},\ \dfrac{1.9196-1.0767}{0.5419}\,\Big]=[\, -1.0394,\ -0.7847,\ 0.3465,\ 1.4777\,] \)


Encoder–Decoder Attention (Cross-Attention)

This part is new! If you were wondering where the embeddings produced by the encoder go, now is exactly the time for them!

In self-attention we compute the queries, keys, and values for the same input embedding. In encoder–decoder attention we compute the queries from the previous decoder layer, and the keys and values from the encoder’s outputs. All computations stay the same as before; the only difference is which embedding we use for the queries.

Attention in the encoder or masked MHA in the decoder Cross-Attention
Q, K, V are taken from the same input Q — from the decoder; K, V — from the encoder

The point is that we want the decoder to focus on the relevant parts of the input text (that is, “hello world”). Encoder–decoder attention lets each position in the decoder visit all positions of the input sequence. This is very useful for tasks such as translation, where the decoder needs to concentrate on the relevant parts of the source sequence. The decoder will learn to focus on the relevant parts of the input while generating the correct output tokens. This is a very powerful mechanism!

Attention Weight Matrices

For each head we define three matrices:

• \( W^{Q}\in\mathbb{R}^{4\times3} \)

• \( W^{K}\in\mathbb{R}^{4\times3} \)

• \( W^{V}\in\mathbb{R}^{4\times3} \)


In total:

• \( W_{1}^{Q},\, W_{1}^{K},\, W_{1}^{V} \) — first head

• \( W_{2}^{Q},\, W_{2}^{K},\, W_{2}^{V} \) — second head


Head 1 (Cross-Attention):

\( W^{Q}_{1}=\begin{bmatrix}-0.573&-0.492&0.267\\-0.046&-0.406&0.001\\-0.417&0.236&-0.065\\-0.143&-0.238&0.156\end{bmatrix} \) \( W^{K}_{1}=\begin{bmatrix}-0.166&-0.495&-0.458\\0.554&0.490&0.240\\-0.281&0.563&0.335\\0.260&-0.061&-0.273\end{bmatrix} \) \( W^{V}_{1}=\begin{bmatrix}-0.484&0.483&-0.053\\-0.357&-0.233&0.095\\-0.388&0.428&0.310\\0.263&-0.081&0.153\end{bmatrix} \)


Head 2 (Cross-Attention):

\( W^{Q}_{2}=\begin{bmatrix}0.101&0.180&-0.499\\-0.101&-0.550&-0.007\\-0.204&-0.427&-0.476\\0.105&-0.395&0.510\end{bmatrix} \) \( W^{K}_{2}=\begin{bmatrix}0.097&-0.184&0.109\\-0.573&0.550&-0.021\\0.339&-0.501&-0.016\\-0.011&0.525&0.086\end{bmatrix} \) \( W^{V}_{2}=\begin{bmatrix}-0.032&-0.280&-0.202\\0.025&-0.073&-0.574\\0.392&0.475&-0.432\\0.065&-0.470&0.207\end{bmatrix} \)


Cross-Attention: computing Q, K, V

\( Q=x_{\text{decoder}}\cdot W^{Q},\; K=x_{\text{encoder}}\cdot W^{K},\; V=x_{\text{encoder}}\cdot W^{V} \)


Queries (Q) come from the decoder output after LayerNorm:

\( x_{\text{decoder}}=[-1.0394,\,-0.7847,\,0.3465,\,1.4777] \)


Keys and values (K, V) come from the encoder outputs (two embeddings):

\( x_{\text{encoder}}=\begin{bmatrix}0.8738&-1.6346&0.0096&0.7513\\0.8749&-1.6313&-0.0009&0.7573\end{bmatrix} \)


Head 1:

\( Q_{1}=[0.276,\,0.560,\,-0.070] \) \( K_{1}=\begin{bmatrix}-0.858&-1.274&-0.994\\-0.852&-1.279&-0.999\end{bmatrix} \) \( V_{1}=\begin{bmatrix}0.355&0.746&-0.084\\0.359&0.740&-0.086\end{bmatrix} \)


Head 2:

\( Q_{2}=[0.059,\,-0.487,\,1.113] \) \( K_{2}=\begin{bmatrix}1.016&-0.670&0.195\\1.010&-0.660&0.195\end{bmatrix} \) \( V_{2}=\begin{bmatrix}-0.016&-0.473&0.913\\-0.020&-0.481&0.917\end{bmatrix} \)


Cross-Attention — Head 1

\( Q_{1} = [0.276,\, 0.560,\, -0.070] \) \( K_{1} = \begin{bmatrix} -0.858 & -1.274 & -0.994 \\ -0.852 & -1.279 & -0.999 \end{bmatrix} \) \( V_{1} = \begin{bmatrix} 0.355 & 0.746 & -0.084 \\ 0.359 & 0.740 & -0.086 \end{bmatrix} \)


Dot products (before scaling):

\( Q_{1}\cdot K_{1}^{T} = [-0.5085,\, -0.5090] \)


Scale:

\( \dfrac{Q_{1}K_{1}^{T}}{\sqrt{3}} = [-0.5085,\, -0.5090] \div 1.732 \Rightarrow [-0.2935,\, -0.2938] \)


Softmax:

\( \alpha_{1} = \mathrm{softmax}([-0.2935,\, -0.2938]) = [0.5001,\, 0.4999] \)


Weighted value:

\( z_{1} = \alpha_{1}\cdot V_{1} = 0.5001\cdot[0.355,\, 0.746,\, -0.084] + 0.4999\cdot[0.359,\, 0.740,\, -0.086] = [0.3567,\, 0.7430,\, -0.0850] \)


Cross-Attention — Head 2

\( Q_{2} = [0.059,\, -0.487,\, 1.113] \) \( K_{2} = \begin{bmatrix} 1.016 & -0.670 & 0.195 \\ 1.010 & -0.660 & 0.195 \end{bmatrix} \) \( V_{2} = \begin{bmatrix} -0.016 & -0.473 & 0.913 \\ -0.020 & -0.481 & 0.917 \end{bmatrix} \)


Dot products (before scaling):

\( Q_{2}\cdot K_{2}^{T} = [0.3481,\, 0.3456] \)


Scale:

\( \dfrac{Q_{2}K_{2}^{T}}{\sqrt{3}} = [0.3481,\, 0.3456] \div 1.732 \Rightarrow [0.2009,\, 0.1994] \)


Softmax:

\( \alpha_{2} = \mathrm{softmax}([0.2009,\, 0.1994]) = [0.5006,\, 0.4994] \)


Weighted value:

\( z_{2} = \alpha_{2}\cdot V_{2} = 0.5006\cdot[-0.016,\,-0.473,\,0.913] + 0.4994\cdot[-0.020,\,-0.481,\,0.917] = [-0.0177,\,-0.4770,\,0.9147] \)



Outputs from two Cross-Attention heads

Head 1 \( Z_{1} \):

\( Z_{1} = [0.3567,\, 0.7430,\, -0.0850] \)


Head 2 \( Z_{2} \):

\( Z_{2} = [-0.0177,\, -0.4770,\, 0.9147] \)


Concatenation of the two heads

\( \mathrm{Concat}(Z_{1},\, Z_{2}) = [0.3567,\, 0.7430,\, -0.0850,\, -0.0177,\, -0.4770,\, 0.9147]\; (1\times 6) \)


Multiplication by the learnable matrix \( W^{O} \)

\( W^{O}\in\mathbb{R}^{6\times 4},\quad W^{O}=\begin{bmatrix} 0.2 & -0.1 & 0.4 & 0.3 \\ 0.3 & 0.2 & 0.5 & 0.1 \\ 0.1 & 0.3 & 0.2 & 0.4 \\ -0.2 & 0.5 & 0.1 & 0.2 \\ 0.4 & 0.1 & 0.6 & 0.3 \\ 0.3 & -0.3 & 0.2 & 0.5 \end{bmatrix} \)


Final output of MHA Cross-Attention

In other words:

\( \mathrm{MHA}_{\text{cross}}(x) = \mathrm{Concat}(Z_{1}, Z_{2}) \cdot W_{\text{cross}}^{O} \)


Where:

- \( Z_{1}, Z_{2} \) — outputs of the two Cross-Attention heads

- \( W_{\text{cross}}^{O} \) — learnable matrix used to combine the heads

- The result is a vector of size \( d_{\text{model}} \) (here \( 4 \))


\( \mathrm{CrossAttention}(x) = [0.3729,\,-0.2435,\,0.3922,\,0.4580] \)


Residual Connection and LayerNorm (after Cross-Attention)

Input after previous normalization (decoder output after Masked MHA):

\( x = [-1.0394,\,-0.7847,\,0.3465,\,1.4777] \)


Cross-Attention output:

\( \mathrm{MHA}_{\text{cross}}(x) = [0.3729,\,-0.2435,\,0.3922,\,0.4580] \)


Residual connection:

\( z = x + \mathrm{MHA}_{\text{cross}}(x) = [-0.6665,\,-1.0283,\,0.7386,\,1.9357] \)


Layer Normalization:

Formula:

\( \mathrm{LayerNorm}(z) = \dfrac{z_i - \mu}{\sqrt{\sigma^{2} + \varepsilon}} \)


Where:

\( \mu = 0.2444, \quad \sigma = 1.1783 \)

\( \mathrm{LayerNorm}(z) = [-0.7734,\,-1.0804,\,0.4190,\,1.4349] \)


Feed-Forward Network — two linear layers with ReLU in between:

\( \mathrm{FFN}(x) = \mathrm{Linear}_{2}\!\big(\mathrm{ReLU}(x \cdot W_{1} + b_{1})\big) \cdot W_{2} + b_{2} \)


Weight matrices (initialized randomly, fixed for the example):

Linear layer 1:

\( W_{1}\in\mathbb{R}^{4\times 8} \) (expands vector \(4 \to 8\))

\( W_{1}=\begin{bmatrix} 0.40&-0.24&-0.35&-0.07&-0.55&-0.33&0.55&0.36\\ 0.27&-0.47&-0.29&0.47&-0.22&0.46&0.36&0.33\\ 0.42&-0.25&0.55&0.43&-0.06&-0.35&0.56&-0.39\\ 0.27&0.26&-0.31&-0.04&0.12&0.28&0.41&0.28 \end{bmatrix}, \quad b_{1}=[-0.18,\,-0.17,\,-0.55,\,0.03,\,0.37,\,0.31,\,0.26,\,-0.31] \)


Intermediate computation:

\( x \cdot W_{1} + b_{1} = [-0.503,\, 0.556,\,-0.010,\,-0.603,\,-0.160,\,0.680,\,0.234,\,1.563] \)


Apply ReLU (zero out negative values):

\( h = \mathrm{ReLU}(x \cdot W_{1} + b_{1}) = [0,\, 0.556,\, 0,\, 0,\, 0,\, 0.680,\, 0.234,\, 1.563] \)


Linear layer 2:

\( W_{2}\in\mathbb{R}^{8\times 4} \) (compresses \(8\to 4\)),

\( W_{2}=\begin{bmatrix} 0.08&-0.33&-0.26&0.23\\ 0.05&-0.12&0.32&0.39\\ 0.01&0.18&0.01&0.30\\ 0.25&0.34&0.56&0.55\\ 0.43&-0.50&0.04&-0.28\\ -0.21&-0.22&0.27&-0.56\\ -0.52&0.56&0.43&-0.05\\ -0.26&0.58&-0.25&-0.10 \end{bmatrix},\quad b_{2}=[-0.01,\;0.04,\;-0.02,\;0.06] \)


FFN result:

\( \mathrm{FFN}(x)=h\cdot W_{2}+b_{2}=[-0.269,\;1.001,\;-0.603,\;0.660] \)


Residual Connection:

\( z=x+\mathrm{FFN}(x)=[-1.043,\;-0.080,\;-0.184,\;2.095] \)


Layer Normalization (per vector \(z\)):

Step 1 — mean:

\( \mu=\frac{1}{4}(-1.043-0.080-0.184+2.095)=\frac{0.790}{4}=0.1976 \)


Step 2 — variance and standard deviation:

\( \sigma^{2}=\frac{1}{4}\big[(-1.043-0.1976)^{2}+(-0.080-0.1976)^{2}+(-0.184-0.1976)^{2}+(2.095-0.1976)^{2}\big]=\frac{5.368}{4}=1.342,\quad \sigma=\sqrt{1.342}\approx 1.1585 \)


Step 3 — normalize each element:

\( \mathrm{LayerNorm}(z_{i})=\frac{z_{i}-\mu}{\sigma}\Rightarrow \mathrm{LayerNorm}(z)=\left[\frac{-1.043-0.1976}{1.1585},\;\frac{-0.080-0.1976}{1.1585},\;\frac{-0.184-0.1976}{1.1585},\;\frac{2.095-0.1976}{1.1585}\right]=[-1.071,\;-0.239,\;0.329,\;1.639] \)


Generating the Output Sequence

We already have everything we need to produce the model’s output sequence:

  • Encoder — receives the input sequence and builds a rich, contextual representation of it (a stack of encoder blocks).

  • Decoder — receives the encoder’s outputs plus the tokens generated so far and produces the output sequence (a stack of decoder blocks).

To turn the decoder’s outputs into an actual word, we place a final linear layer and a softmax on top of the decoder. The overall procedure is:

  1. Encode the input.
    The encoder processes the input sequence and produces a contextual representation using the encoder stack.

  2. Initialize the decoder.
    Decoding starts with the embedding of the SOS (Start-of-Sequence) token together with the encoder outputs.

  3. Run the decoder.
    The decoder uses the encoder outputs and the embeddings of all previously generated tokens to produce a new list of embeddings.

  4. Linear layer → logits.
    Apply a linear layer to the last decoder embedding to generate logits (raw scores) for the next token.

  5. Softmax → probabilities.
    Pass the logits through softmax to obtain a probability distribution over the possible next tokens.

  6. Iterative token generation.
    Repeat the process: at each step the decoder generates the next token based on the accumulated generated tokens and the encoder outputs.

  7. Form the sentence.
    Continue until the EOS (End-of-Sequence) token is produced or a predefined maximum length is reached.


Linear Layer

A linear layer is a simple linear transformation. It takes the decoder’s output and maps it to a vector of size vocab_size (the vocabulary size).
For example, with a vocabulary of 10,000 words, the linear layer maps the decoder output to a length-10,000 vector. Each position corresponds to a word in the vocabulary and holds its logit (which becomes a probability after softmax) for being the next word in the sequence. For tutorials or demos, you can start with a small vocabulary (e.g., 10 words).

Linear layer for logits

\( \text{logits}=x\,W_{\text{vocab}}+b \)



Dimensions:

\( x\in\mathbb{R}^{1\times4} \)

\( W_{\text{vocab}}\in\mathbb{R}^{4\times10} \)

\( b\in\mathbb{R}^{10} \)



Result: \( \text{logits}\in\mathbb{R}^{1\times10} \)

\( W_{\text{vocab}}=\begin{bmatrix}-0.125&-0.458&0.474&0.229&-0.032&0.408&-0.243&0.038&0.036&0.177\\-0.186&-0.190&-0.147&-0.449&0.064&0.072&-0.189&0.084&-0.440&0.487\\0.061&0.112&0.076&-0.073&-0.278&0.487&0.020&0.171&-0.230&-0.187\\-0.228&0.294&0.125&-0.076&0.262&0.303&0.273&0.269&0.278&-0.062\end{bmatrix} \)



Bias:

\( b=[-0.042,\,0.039,\,-0.032,\,0.082,\,-0.053,\,0.009,\,-0.014,\,-0.010,\,0.013,\,0.076] \)

Logit computation (linear projection)



Formula:

\( \text{logits}_i=\sum_{j=1}^{4}x_j\,W_{j,i}+b_i \)



Example (token 0):

\( \text{logit}_0=(-1.071)(-0.125)+(-0.239)(-0.186)+(-0.329)(0.061)+(1.639)(-0.228)+(-0.042)=0.041 \)



Result: logits

\( \text{logits}=[0.041,\,0.844,\,-0.819,\,0.668,\,0.373,\,-0.792,\,-0.911,\,-0.134,\,0.023,\,0.168] \)



Apply softmax:

\( \text{softmax}(z_i)=\dfrac{e^{z_i}}{\sum_j e^{z_j}} \)



\( \text{probs}=[0.107,\,0.240,\,0.045,\,0.053,\,0.150,\,0.047,\,0.041,\,0.090,\,0.105,\,0.122] \)


Conclusion:

The maximum probability is for token 1, with probability \( \approx 24\% \).

This will be the next predicted token by the decoder (hola).

vocab = [hello, hola, EOS, SOS, mundo, world, a, ?, the, el]

What is used as the input to the linear layer? The decoder outputs one embedding for each token in the sequence. The input to the linear layer is the last generated embedding.

This last embedding contains information about the entire sequence up to that step, which means every decoder output embedding carries information about the whole sequence so far.


Iterative token generation in a transformer (example)

1. Start

Generation begins with the token "SOS".

In our case, this is the only input to the decoder at the first step.

2. First step

The decoder receives ["SOS"] and, using the encoder outputs (["hello", "world"]), generates the first token.

The result becomes the token "hola".

3. Second step

The new decoder input is ["SOS", "hola"].

The decoder again consults the encoder and, taking "hola" into account, generates the next token.

For example, the result becomes "mundo".

4. Continue

The next input is ["SOS", "hola", "mundo"].

Generation continues in the same way until the "EOS" token is generated.


Stopping generation

1. Token "EOS"

As soon as the decoder predicts "EOS", generation stops.

This is the signal that the sentence is finished.

2. Maximum length

If the decoder keeps generating but "EOS" does not appear, the process stops automatically

when the maximum length is reached (e.g., 20 tokens).



LLM Evolution

Model Year Architecture Parameters Tokens Tokens (B) Tokens/Param
GPT 2018 Transformer 117M about 1B words 1.0 8.55
GPT-2 2018 Transformer 1.5B 40 GB of text 2.0 1.33
GPT-3 2020 Transformer 175B 300B 300.0 1.71
PaLM 2022 Sparse Transformer 540B 780B 780.0 1.44
Gopher 2021 Transformer 280B 300B 300.0 1.07
Jurassic-1 2021 Transformer 178B 300B 300.0 1.69
LaMDA 2022 Transformer 137B 168B 168.0 1.23
Megatron Turing NLG 2022 Transformer 530B 270B 270.0 0.51
Chinchilla 2022 Transformer 70B 1.4T 1400.0 20.0
LLaMa 2023 Transformer 65B 1.4T 1400.0 21.54
LLaMa 2 2023 Transformer 70B 2T 2000.0 28.57


Vision Transformer (ViT) — an architecture that adapts the transformer mechanism, originally developed for text, to computer vision. ViT splits an image into patches (small blocks) and represents them as a sequence, analogous to tokens in text.

Main stages of a Vision Transformer

1. Splitting the image into patches

An image \(I\) of size \(H \times W \times C\) is divided into patches of size \(P \times P\). Each patch is flattened into a 1-D feature vector.

• Number of patches: \(N = \dfrac{H \times W}{P \times P}\).

• Dimensionality of each patch: \(P \times P \times C\).

The patches are concatenated into a sequence, analogous to NLP tokens.

2. Generating linear embeddings

Each patch \(x_i\) is mapped to a fixed-dimensional embedding \(D\) by a linear layer:

\(z_0^i = E(x_i)\).

Where:

• \(E\) — the linear layer,

• \(z_0\) — the sequence of patch embeddings.

To encode patch order, positional embeddings \(p_i\) are added:

\(z_0 = [\,z_0^1 + p_1,\; z_0^2 + p_2,\; \ldots,\; z_0^N + p_N\,]\).

3. Multi-Head Self-Attention (MHA)

After forming the sequence of patch embeddings, they pass through transformer blocks that include:

3.1. Multi-Head Self-Attention (MHA):

• Computes relationships between patches.

• Helps capture global dependencies.

Attention formula: \( \mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\!\left(\frac{QK^{T}}{\sqrt{d_k}}\right) V \).

Where:

• \(Q, K, V\) are Query, Key, and Value obtained by linear projections of the sequence.

• \(d_k\) is the Key dimensionality.

3.2. Feed-Forward Network (FFN):

• Applies two fully connected layers with a nonlinearity between them.

3.3. Residual Connections:

• Improve stability and mitigate gradient vanishing.

3.4. Layer Normalization:

• Normalizes each layer’s output to accelerate training.

4. Classification token \([ \mathrm{CLS} ]\):

A special \([ \mathrm{CLS} ]\) token is added to the sequence of patches. Its embedding at the end of the transformer is used for classification:

\( z_{L}^{[\mathrm{CLS}]} \rightarrow \text{FC Layer} \rightarrow \text{Softmax} \).

5. Final fully connected layer:

The output vector \( z_{L}^{[\mathrm{CLS}]} \) after the last transformer block goes through a fully connected layer with \(C\) outputs (the number of classes).

Differences Between Attention Mechanisms

Mechanism Main Objective Example Usage
Channel Attention Account for channel importance SE block, CBAM
Spatial Attention Account for spatial-region importance CBAM
Self-Attention Capture global dependencies Transformer, ViT
Multi-Head Attention Extend self-attention with multiple heads ViT, NLP

Modern Architectures

Vision Transformer (ViT): a modern architecture for image processing

The Vision Transformer (ViT) adapts the transformer mechanism—originally designed for text—to computer vision. It was introduced in 2020 in the paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”.

Core idea of the Vision Transformer

ViT processes an image as a sequence of patches (small image blocks), analogous to token processing in NLP, without using convolutional layers.

1. Image → Patches

• The input image is split into equal-sized patches (e.g., \(16 \times 16\)).

• Each patch is projected to a fixed-dimensional feature vector via a linear layer.

2. Sequence of tokens

• Patches are treated as a sequence of tokens (analogous to words in NLP).

• Positional embeddings are added to preserve information about patch order.

3. Transformer

• The token sequence is passed through several Transformer layers with Self-Attention and a Feed-Forward Network (FFN).

4. Classification token \([CLS]\)

• A special token \([CLS]\) is prepended to the sequence. Its final embedding after the Transformer is used for prediction.

Vision Transformer Architecture

1. Splitting an image into patches

• An image \(I\) of size \(H \times W \times C\) (height, width, channels) is split into patches of size \(P \times P\). Each image consists of patches.

• Number of patches: \(N = \frac{H \cdot W}{P^{2}}\).

• Each patch is flattened into a 1-D vector of length \(P \cdot P \cdot C\).

2. Linear projection of patches

• A linear layer maps each patch \(x_i\) to a fixed-dimensional embedding \(D\):

\(z_{0}^{\,i} = E(x_i)\).

Where: \(x_i\) — the \(i\)-th patch; \(E\) — the linear projection.

3. Adding positional embeddings

• To retain patch order, a positional embedding \(p_i\) is added to each patch embedding:

\(z_{0} = [\,z_{0}^{\,1} + p_{1},\; z_{0}^{\,2} + p_{2},\; \ldots,\; z_{0}^{\,N} + p_{N}\,]\).

4. Multi-layer Transformer

Each Transformer layer has two key parts:

4.1 Multi-Head Self-Attention (MHA)

• Captures global dependencies among patches.

• Attention operation: \(\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\!\big((QK^{\top})/\sqrt{d_k}\big)\,V\).

Here \(Q,K,V\) are Query, Key, Value obtained by linear projections of the sequence, and \(d_k\) is the Key dimension.

4.2 Feed-Forward Network (FFN)

• Two fully connected layers with a nonlinearity in between:

\(\mathrm{FFN}(x) = \mathrm{ReLU}(W_{1}x + b_{1})\,W_{2} + b_{2}\).

4.3 Residual connections and normalization

• Each part is wrapped with a residual add and LayerNorm:

\(y = \mathrm{LayerNorm}(x + \mathrm{MHA}(x))\),

\(z = \mathrm{LayerNorm}(y + \mathrm{FFN}(y))\).

5. Classification token (CLS)

• A special token \([CLS]\) is prepended to the sequence to aggregate information over all patches. Its final embedding after the Transformer is used for classification.

# Vision Transformer (ViT) — compact Keras implementation
# --------------------------------------------------------
# - Splits an image into non-overlapping patches
# - Projects patches to embeddings + adds positional embeddings
# - Stacks Transformer encoder blocks (MHA + FFN with residuals & LayerNorm)
# - Mean-pools tokens and applies a softmax classifier head

import math
import tensorflow as tf
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense, Dropout, LayerNormalization, Embedding


# -----------------------------
# Multi-Head Self-Attention
# -----------------------------
class MultiHeadSelfAttention(tf.keras.layers.Layer):
    def __init__(self, embed_dim: int, num_heads: int, **kwargs):
        super().__init__(**kwargs)
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.projection_dim = embed_dim // num_heads

        self.query_dense = Dense(embed_dim)
        self.key_dense   = Dense(embed_dim)
        self.value_dense = Dense(embed_dim)
        self.out_dense   = Dense(embed_dim)

    def _separate_heads(self, x):
        # x: [B, N, D] -> [B, h, N, d]
        B = tf.shape(x)[0]
        N = tf.shape(x)[1]
        x = tf.reshape(x, [B, N, self.num_heads, self.projection_dim])
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def _scaled_dot_product(self, q, k, v):
        # q,k,v: [B, h, N, d]
        scale = tf.cast(tf.shape(k)[-1], tf.float32)
        logits = tf.matmul(q, k, transpose_b=True) / tf.sqrt(scale)
        weights = tf.nn.softmax(logits, axis=-1)
        return tf.matmul(weights, v)  # [B, h, N, d]

    def call(self, inputs):
        # inputs: [B, N, D]
        q = self.query_dense(inputs)
        k = self.key_dense(inputs)
        v = self.value_dense(inputs)

        q = self._separate_heads(q)
        k = self._separate_heads(k)
        v = self._separate_heads(v)

        attn = self._scaled_dot_product(q, k, v)              # [B, h, N, d]
        attn = tf.transpose(attn, perm=[0, 2, 1, 3])          # [B, N, h, d]
        B, N = tf.shape(attn)[0], tf.shape(attn)[1]
        attn = tf.reshape(attn, [B, N, self.embed_dim])       # [B, N, D]
        return self.out_dense(attn)


# -----------------------------
# Transformer Encoder Block
# -----------------------------
class TransformerEncoder(tf.keras.layers.Layer):
    def __init__(self, embed_dim: int, num_heads: int, ff_dim: int, rate: float = 0.1, **kwargs):
        super().__init__(**kwargs)
        self.mha = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim),
        ])

        self.norm1 = LayerNormalization(epsilon=1e-6)
        self.norm2 = LayerNormalization(epsilon=1e-6)
        self.drop1 = Dropout(rate)
        self.drop2 = Dropout(rate)

    def call(self, x, training=False):
        # MHA + residual
        attn = self.mha(x)
        attn = self.drop1(attn, training=training)
        x = self.norm1(x + attn)
        # FFN + residual
        ffn = self.ffn(x)
        ffn = self.drop2(ffn, training=training)
        return self.norm2(x + ffn)


# -----------------------------
# Vision Transformer model
# -----------------------------
def vision_transformer(
    input_shape=(224, 224, 3),
    num_classes=10,
    patch_size=16,
    embed_dim=64,
    num_heads=8,
    num_layers=12,
    ff_dim=128,
    dropout=0.1,
):
    H, W, C = input_shape
    assert H % patch_size == 0 and W % patch_size == 0, "Image size must be divisible by patch_size"
    n_patches = (H // patch_size) * (W // patch_size)

    inputs = Input(shape=input_shape)

    # 1) Split image into patches
    patches = tf.image.extract_patches(
        images=inputs,
        sizes=[1, patch_size, patch_size, 1],
        strides=[1, patch_size, patch_size, 1],
        rates=[1, 1, 1, 1],
        padding="VALID",
    )  # [B, H/P, W/P, P*P*C]
    patches = tf.reshape(patches, [-1, n_patches, patch_size * patch_size * C])  # [B, N, P^2*C]

    # 2) Linear projection to token embeddings
    tokens = Dense(embed_dim)(patches)  # [B, N, D]

    # 3) Positional embeddings (learned)
    pos_embed = Embedding(input_dim=n_patches, output_dim=embed_dim)
    positions = tf.range(start=0, limit=n_patches, delta=1)
    tokens = tokens + pos_embed(positions)  # broadcast add

    x = Dropout(dropout)(tokens)

    # 4) Stack Transformer encoders
    for _ in range(num_layers):
        x = TransformerEncoder(embed_dim, num_heads, ff_dim, rate=dropout)(x)

    # 5) Classification head (mean-pool tokens)
    x = tf.reduce_mean(x, axis=1)  # [B, D]
    x = Dropout(dropout)(x)
    outputs = Dense(num_classes, activation="softmax")(x)

    return Model(inputs, outputs, name="vit")


# -----------------------------
# Example usage
# -----------------------------
if __name__ == "__main__":
    model = vision_transformer(
        input_shape=(224, 224, 3),
        num_classes=10,
        patch_size=16,
        embed_dim=64,
        num_heads=8,
        num_layers=12,
        ff_dim=128,
        dropout=0.1,
    )
    model.summary()

Previous
Previous

Convolutional Neural Networks: Principles, Architectures, and Modern Extensions