Attention (as Discrete-Time Markov) Chains

Abstract

We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank—the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

In a nutshell

Left: Attention matrix A with sequence length 5. Right: A DTMC with transition probabilities defined by matrix A, where only strong connections are shown. Our framework analyzes the attention matrix from a markovian chain perspective.

Applications

We show our framework improves various downstream tasks such as zero-shot segmentation, and unconditional image generation. In this example, we "bounce" from the token representing the class name in ImageNet. Column Select is the common approach of selecting the whole column from the attention matrix and reshaping it.

Original Image	Column Select		Ours		Ground Truth
Original Image	Raw Scores	Mask	Raw Scores	Mask	Ground Truth

TokenRank

TokenRank is the unique steady-state vector of the attention matrix, which can be used for visualizations and global understanding of both incoming and outgoing attention. It can serve as a standard tool for visualizing self-attention. Below, we visualize the propagation along the Markov chain for examplary images, layers, and heads for three initial states. TokenRank is reached once the Markov chain has converged.

Image

Initial vector 1 ILSVRC2012_val_00005598 TokenRank

Initial vector 2 ILSVRC2012_val_00005598 TokenRank T_2000

Initial vector 3 ILSVRC2012_val_00005598 TokenRank T_1000

BibTeX

@article{erel2025attentionasdiscretetimemarkov,
      title = {Attention (as Discrete-Time Markov) Chains},
      author = {Erel, Yotam and D{\"u}nkel, Olaf and Dabral, Rishabh and Golyanik, Vladislav and Theobalt, Christian and Bermano, Amit H.},
      journal = {arXiv preprint arXiv:2507.17657},
      year = {2025}
    }