We introduce a new interpretation of the attention matrix as a discrete-time Markov chain.
Our interpretation sheds light on common operations involving attention scores such as selection,
summation, and averaging in a unified framework.
It further extends them by considering indirect attention, propagated through the Markov chain, as opposed
to previous studies that only model immediate effects.
Our main observation is that tokens corresponding to semantically similar regions form a set of
metastable states, where the attention clusters, while noisy attention scores tend to disperse.
Metastable states and their prevalence can be easily computed through simple matrix multiplication and
eigenanalysis, respectively.
Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation.
Lastly, we define TokenRank—the steady state vector of the Markov chain, which measures
global
token importance.
We demonstrate that using it brings improvements in unconditional image generation.
We believe our framework offers a fresh view of how tokens are being attended in modern visual
transformers.