In A Nutshell

Left: Attention matrix A with sequence length 5. Right: A DTMC with transition probabilities defined by matrix A, where only strong connections are shown. Our framework analyzes the attention matrix from a markovian chain perspective.
We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank—the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.
Left: Attention matrix A with sequence length 5. Right: A DTMC with transition probabilities defined by matrix A, where only strong connections are shown. Our framework analyzes the attention matrix from a markovian chain perspective.
We show our framework improves various downstream tasks such as zero-shot segmentation, and unconditional image generation. In this example, we "bounce" from the token representing the class name in ImageNet. Column Select is the common approach of selecting the whole column from the attention matrix and reshaping it.
Original Image | Column Select | Ours | Ground Truth | ||
---|---|---|---|---|---|
Raw Scores | Mask | Raw Scores | Mask | ||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
TokenRank is the unique steady-state vector of the attention matrix, which can be used for visualizations and global understanding of both incoming and outgoing attention. It can serve as a standard tool for visualizing self-attention. Below, we visualize the propagation along the Markov chain for examplary images, layers, and heads for three initial states. TokenRank is reached once the Markov chain has converged.
@misc{erel2025attentionasdiscretetimemarkov,
title={Attention (as Discrete-Time Markov) Chains},
author={Yotam Erel and Olaf Dünkel and Rishabh Dabral and Vladislav Golyanik and Christian Theobalt and Amit H. Bermano},
year={2025},
eprint={2507.17657},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.17657},
}