Attention Rollout Visualization

This article explores D.8 section of ViT paper (opens in a new tab) where we can find very interesting figures about attention maps. Based on the paper Quantifying attention flow in transformers (opens in a new tab) [1], and ViT-pytorch (opens in a new tab) [2], we will grab how the images were calculated.
TL;DR
- Attention weights (self-attention) correspond to information transfer between tokens in the same layer.
- Attention rollout extends this concept to different layers: how much information is transferred from a token in layer to another token in layer ?
- Heatmap you have seen in ViT paper corresponds to the information transfer from the embedded patch layer to the class token in the final layer: how much is each patch used for class inference?
1. Recalling Self-Attention
Information Exchange in the Same Layer
In transformers, tokens (patch in ViT) of an input sequence exchange information in the self-attention layer (refer this colab (opens in a new tab) for interactive deformation ).
For an th token coming from layer (notation borrowed from ViT paper), an attention head in layer first extracts query , key , and value from a linear projection of learnable weights. The th token is associated with a vector in the value space with attention weights proportional to (I will omit softmax and scaling for brevity). This attention weight can be seen as an information exchange between and tokens in a head of attention layer .
Not Eligible Accumulation of Information across Layers
According to the authors of [1], does not seem to accumulate the past information until the layer , although the weight accounts for the exchange in the same layer.
There are multiple ways to have an intuition on this. First of all, attention weights tend to get smaller for upper layers (larger ) as the below figure shows (bought from [1]).

The y-axis denotes layer index and x-axis is tokens. Color denotes the attention weight where is class token.
For layers 4,5, and 6, the magnitudes of attention weights are flat across tokens: no useful information.
Another more statistical approach is investigating the correlation between 1) information score vs 2) attention weights in the final layer , .
- information score of a token: a drop in inference performance when a token is blank-out across layers (e.g., classification performance drop will be large if we blank out mouth patches for a network of cats vs dogs).
- : information from the th word or image token into classification token. This token is fed into the classification head (MLP) in ViT.
Let us assume that: if attention weight has well-accumulated information from token in the first layer until the final layer. Then, there should be a high correlation between the information score and the magnitude . But this is not true if you look into Table 1 of [1].
In conclusion, although can capture information exchange between and tokens in the same layer , it is not an eligible measure accumulated information transfer of token until th layer to token.
2. Attention Rollout
Information across Single Layer
Then, what is a good metric to measure accumulated information transfer from a token in layer to another token in layer ? (Good means high correlation in the previous section). Paper [1] presented two measures: information rollout and information flow. In this article, I will examine the former only.
First of all, let us collectively denote attention weights at layer into matrix where ( = row, = col). The authors propose to encode information exchange through the single layer composed of self-attention and the residual connection can be encoded with .
Information across Multiple Layer
What if we proceed with additional layer ? The (r,c) element of represents information transfer from token to token in layer . Across two layers and , any pair of tokens has multiple paths. For example, in the below figure, can transfer information into along all tokens .


Thus, we might associate element of product with information transfer from to . The same chain is applied for any two tokens in different layers. This is called attention rollout. According to the paper [1], this measure gives a high correlation with information score as shown in Table 1.
Disclaimer
Kindly note that I skipped normalization of matrix for the clarity of explanation. Also, I did not consider the presence of multiple heads. This is implementation-dependent as this repo shows (opens in a new tab). We can take mean along the head or keep only heads with strong attention.
3. Interpreting Heatmap Visualization
Now, one step until D.8 section of ViT paper! Our interest is to measure how much information used for classification from an input patch. As we discussed, this can be derived from attention rollout . Then, what is the meaning of th row of ? It is accumulated information into the classification token from an input patch.

As this figure illustrates, the heatmap value of the first patch can be considered to be a measure how much information was used to infer the class of the image. Of course, the size of a patch is not 1x1, thus, it will be very coarse. In general, many codes use interpolation to fit the patch heatmap into the original image.
My explanation so far is boiled down to this notebook (opens in a new tab) written by jeonsworld.
Please leave comments if you found something weird and incorrect. Hope this article helped you understand the frequent heatmap in the context of transformers!