Attention (as Discrete-Time Markov) Chains

*Denotes equal contribution
1Tel Aviv University 2MPI for Informatics

Abstract

We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank—the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

In A Nutshell

intuition

Left: Attention matrix A with sequence length 5. Right: A DTMC with transition probabilities defined by matrix A, where only strong connections are shown. Our framework analyzes the attention matrix from a markovian chain perspective.


Applications

We show our framework improves various downstream tasks such as zero-shot segmentation, and unconditional image generation. In this example, we "bounce" from the token representing the class name in ImageNet. Column Select is the common approach of selecting the whole column from the attention matrix and reshaping it.

Original Image Column Select Ours Ground Truth
Raw Scores Mask Raw Scores Mask

TokenRank

TokenRank is the unique steady-state vector of the attention matrix, which can be used for visualizations and global understanding of both incoming and outgoing attention. It can serve as a standard tool for visualizing self-attention. Below, we visualize the propagation along the Markov chain for examplary images, layers, and heads for three initial states. TokenRank is reached once the Markov chain has converged.


Image ILSVRC2012_val_00005598 n03345487_fire_engine n03452741_grand_piano
Initial vector 1 ILSVRC2012_val_00005598 TokenRank n03345487_fire_engine TokenRank n03452741_grand_piano TokenRank
Initial vector 2 ILSVRC2012_val_00005598 TokenRank T_2000 n03345487_fire_engine TokenRank T_2000 n03452741_grand_piano TokenRank T_2000
Initial vector 3 ILSVRC2012_val_00005598 TokenRank T_1000 n03345487_fire_engine TokenRank T_1000 n03452741_grand_piano TokenRank T_1000

BibTeX


      @misc{erel2025attentionasdiscretetimemarkov,
      title={Attention (as Discrete-Time Markov) Chains}, 
      author={Yotam Erel and Olaf Dünkel and Rishabh Dabral and Vladislav Golyanik and Christian Theobalt and Amit H. Bermano},
      year={2025},
      eprint={2507.17657},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.17657}, 
}