YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

InfSA β€” Infinite Self-Attention

A spectral, graph-theoretic reformulation of self-attention for Transformers

Paper (arXiv) β€’ Install β€’ Quick Start β€’ LLMs β€’ Vision Models β€’ API β€’ Results

PyTorch License Python


InfSA (Infinite Self-Attention) replaces standard softmax(QK^T/√d) attention with a spectral diffusion mechanism grounded in graph centrality theory. Each attention layer becomes a diffusion step on a content-adaptive token graph, and token importance is determined by eigenvector centrality β€” the same principle behind PageRank and Katz ranking β€” rather than by local query-key affinity.


Softmax attention distributes focus across background regions, while InfSA variants produce sharper, object-aligned activations.

Key properties

  • Drop-in compatible β€” works with any Transformer: ViT, DINO, RT-DETR, GPT, LLaMA, Qwen, BERT, etc.
  • Two variants: pure_infsa (quadratic, highest quality) and linear_infsa (O(N), scales to 332K tokens)
  • Learnable ρ β€” the spectral decay parameter adapts during training via sigmoid reparameterization
  • One-liner conversion β€” model = infsa.convert(model, variant="pure_infsa")
  • 13Γ— faster than standard ViT at 1024Β² resolution (Linear-InfSA)
  • +3.2 pp ImageNet-1K accuracy gain (84.7% vs 81.5%) as a pure architectural improvement

Installation

pip install infsa

Or from source:

cd InfSA_release
pip install -e .

Requirements: torch >= 1.13.0. No other dependencies.


Quick Start

One-liner: convert any existing model

import infsa

# Replace ALL attention layers with InfSA β€” weights are copied automatically
model = infsa.convert(model, variant="pure_infsa")

That's it. Works with torchvision, HuggingFace, timm, or any custom Transformer.


Use with Vision Models

Torchvision ViT

import infsa
from torchvision.models import vit_b_16, ViT_B_16_Weights

model = vit_b_16(weights=ViT_B_16_Weights.DEFAULT)
model = infsa.convert(model, variant="pure_infsa")
# All 12 attention layers now use InfSA. Fine-tune as usual.

DINO / DINOv2

import torch
import infsa

model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
model = infsa.convert(model, variant="pure_infsa", rho_init=0.9)
# DINOv2 ViT-B/14 now uses InfSA attention in all 12 layers

HuggingFace ViT

import infsa
from transformers import ViTForImageClassification
from transformers.models.vit.modeling_vit import ViTSelfAttention

model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
model = infsa.convert(model, variant="pure_infsa", target_types=[ViTSelfAttention])

RT-DETR (Object Detection)

import infsa
from transformers import RTDetrForObjectDetection

model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd")
# Convert encoder attention to InfSA
model = infsa.convert(
    model, variant="pure_infsa",
    include_patterns=[r"encoder\.layers\."],
)

timm ViT / Swin / DeiT

import timm
import infsa

model = timm.create_model('vit_base_patch16_224', pretrained=True)
model = infsa.convert(model, variant="linear_infsa")

Use with LLMs

InfSA is modality-agnostic β€” the graph diffusion principles apply to language tokens just as they do to image patches.

Qwen3-8B

import infsa
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="auto")
# Convert all attention layers to InfSA
model = infsa.convert(model, variant="pure_infsa", rho_init=0.9)
# Fine-tune with your data β€” ρ is learnable per layer

GPT-2

import infsa
from transformers import AutoModel

model = AutoModel.from_pretrained("gpt2")
model = infsa.convert(model, variant="pure_infsa")
# All 12 attention layers replaced βœ“

LLaMA / Mistral / Gemma

import infsa
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = infsa.convert(model, variant="linear_infsa", rho_init=0.9)

Custom LLM from scratch

import infsa
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim=4096, num_heads=32):
        super().__init__()
        self.attn = infsa.InfSAAttention(
            embed_dim=embed_dim,
            num_heads=num_heads,
            variant="pure_infsa",
            rho_init=0.9,
            rho_trainable=True,
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
        )

    def forward(self, x):
        x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0]
        x = x + self.ff(self.norm2(x))
        return x

Functional API (maximum flexibility)

For custom attention implementations where you control Q/K/V directly:

import infsa

# Inside your custom forward method, after Q/K/V projection:
# q, k, v shape: (batch, heads, seq_len, head_dim)
output = infsa.infsa_attention(q, k, v, variant="pure_infsa", rho=0.9)

Selective Conversion

Convert only specific layers, or exclude certain parts of a model:

# Only convert the first 6 layers
model = infsa.convert(model, include_patterns=[r"layer\.[0-5]\."])

# Convert everything except the decoder
model = infsa.convert(model, exclude_patterns=[r"decoder\."])

# Only convert specific module types
from some_model import CustomAttention
model = infsa.convert(model, target_types=[CustomAttention])

InfSA Variants

Variant Complexity Output Best for
pure_infsa O(NΒ²Β·D) NΓ—N attention matrix Quality-critical tasks, short-medium sequences
linear_infsa O(NΒ·D) NΓ—1 importance vector Long sequences, high resolution, efficiency-critical

How it works

Standard softmax attention:

A=softmax ⁣(QK⊀d)A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)

Pure InfSA replaces softmax with Frobenius-normalized ReLU, turning each layer into a spectral diffusion step:

A^(l)=[ Q(l)K(l)βŠ€β€‰]+βˆ₯[ Q(l)K(l)βŠ€β€‰]+βˆ₯F+Ξ΅\hat{A}^{(l)} = \frac{[\,Q^{(l)}{K^{(l)}}^\top\,]_+}{\|[\,Q^{(l)}{K^{(l)}}^\top\,]_+\|_F + \varepsilon}

Across $L$ Transformer layers, the accumulated output implements a truncated Neumann series:

SL=βˆ‘t=1LΞ³t A^(t)β‹―A^(1) X(0)β†’Lβ†’βˆž(Iβˆ’Ξ³A^)βˆ’1βˆ’IS_L = \sum_{t=1}^{L} \gamma^{t}\, \hat{A}^{(t)} \cdots \hat{A}^{(1)} \, X^{(0)} \quad\xrightarrow{L\to\infty}\quad \bigl(I - \gamma\hat{A}\bigr)^{-1} - I

This is equivalent to computing Katz centrality on the token graph β€” the same mathematical object underlying PageRank and the fundamental matrix of absorbing Markov chains.

Linear InfSA approximates the principal eigenvector of the implicit attention operator in O(N) time:

qΛ‰=βˆ‘iΞ±iQi,aj=[qΛ‰βŠ€Kj]+βˆ‘l[qΛ‰βŠ€Kl]++Ξ΅\bar{q} = \sum_i \alpha_i Q_i, \qquad a_j = \frac{[\bar{q}^\top K_j]_+}{\sum_l [\bar{q}^\top K_l]_+ + \varepsilon}


Absorbing Markov chain interpretation: InfSA's multi-hop propagation correctly identifies globally important tokens that single-hop softmax attention misses.


API Reference

infsa.convert(model, variant, **kwargs)

Replace all attention layers in any model.

Parameter Type Default Description
model nn.Module β€” Model to convert
variant str "pure_infsa" "pure_infsa" or "linear_infsa"
rho_init float 0.95 Initial spectral decay ρ
rho_trainable bool True Make ρ learnable
copy_weights bool True Copy existing Q/K/V/O weights
include_patterns list[str] None Regex: only convert matching module names
exclude_patterns list[str] None Regex: skip matching module names
target_types list[Type] None Only convert specific module types
inplace bool False Modify model in place

infsa.InfSAAttention(embed_dim, num_heads, variant, **kwargs)

Drop-in nn.Module replacement for nn.MultiheadAttention.

Parameter Type Default Description
embed_dim int β€” Total embedding dimension
num_heads int β€” Number of attention heads
variant str "pure_infsa" "pure_infsa" or "linear_infsa"
dropout float 0.0 Dropout on attention scores
batch_first bool True Input shape (B, N, E) vs (N, B, E)
rho_init float 0.95 Initial ρ value
rho_trainable bool True Whether ρ adapts during training

infsa.infsa_attention(q, k, v, variant, rho)

Functional API on pre-projected tensors (B, H, N, D).

infsa.pure_infsa_scores(q, k, rho) / infsa.linear_infsa_scores(q, k, rho)

Compute raw attention scores without applying to values.


Experimental Results

Classification Accuracy


ImageNet-1K and ImageNet-V2 top-1 accuracy vs. parameter count. InfViT variants (red stars) achieve competitive or superior accuracy. On ImageNet-V2, all InfViT models exceed every baseline (up to 79.8% vs 76.8%).

Key results (4-layer ViT, 53.5M params, DeiT recipe, no distillation):

Model IN-1K Top-1 IN-V2 Top-1 Params GFLOPs
Standard ViT 4L 81.5% β€” 57.7M 17.5
Linear InfViT 4L 84.7% 79.8% 53.5M 59
Pure InfViT 4L 85.1% 79.5% 57.7M 59
Pure InfViT 24L 85.4% 79.8% 330.6M β€”

The +3.2 pp gain of Linear InfViT over Standard ViT is purely architectural β€” same training recipe, no external data.

Efficiency & Scalability


Efficiency dashboard at 1024Β²: Linear InfViT achieves 13.4Γ— speed-up, 231 img/s throughput, and 0.87 J/img β€” best in every metric.


Throughput vs. energy per image for nine attention mechanisms, colored by asymptotic complexity. InfViT Linear (O(N), red star) dominates the top-left corner.

Metric Standard ViT Linear InfViT Improvement
Throughput (1024Β²) 17.19 img/s 230.95 img/s 13.4Γ—
Energy (1024Β²) 11.63 J/img 0.87 J/img 13.4Γ—
Max resolution 1024Β² 9216Β² (332K tokens) Only model to complete
Inference latency 58.16 ms 4.33 ms 13.4Γ—


Training and inference latency across resolutions. Linear InfViT scales near-linearly to 9216Β² while all other models OOM beyond 1024Β².

More scalability plots


4-layer vs 24-layer depth comparison: throughput and energy.

Attention Quality


Attention quality: MoRF-AOC, ROC-AUC, and PR-AUC (%). InfSA variants outperform Standard ViT by 20–34 pp across all metrics.

Metric Standard ViT Pure InfSA Linear InfSA
MoRF-AOC ↑ 42.6% 71.7% 76.0%
ROC-AUC ↑ 53.8% 77.3% 75.9%
PR-AUC ↑ 56.2% 72.4% 76.1%


Bounding-box localization: InfSA attention maps align precisely with object boundaries.

More attention quality plots


MoRF degradation and LeRF retention curves.


Examples

Example Description
vit_torchvision.py Convert a pretrained torchvision ViT-B/16
huggingface_vit.py Convert a HuggingFace ViT
huggingface_llm.py InfSA in GPT-2 and custom LLMs
custom_transformer.py Build a ViT from scratch with InfSA

Testing

Unit tests

pytest tests/test_infsa.py -v

Integration tests

End-to-end tests that validate InfSA on real model architectures with synthetic inputs:

Test Model What it validates
test_torchvision_vit.py torchvision ViT-B/16 infsa.convert() on PyTorch ViT, inference + training + gradient flow
test_huggingface_vit.py HuggingFace ViT Custom wrapper for HF ViTSelfAttention, weight copying
test_detr.py DETR (ResNet-50) Monkey-patching DETR's cross/self-attention with InfSA functional API
test_custom_transformer.py Custom ViT + MiniLLM Building from scratch with InfSAAttention, nn.TransformerEncoder conversion
test_qwen3_llm.py Qwen3-0.6B Auto-converting a full LLM, inference on SQuAD/GSM8K

Run all integration tests:

python tests/integration/run_all.py

Results are written to results/.


Project Structure

InfSA_release/
β”œβ”€β”€ infsa/                  # Core library (pip-installable)
β”‚   β”œβ”€β”€ __init__.py         # Public API exports
β”‚   β”œβ”€β”€ core.py             # Functional: pure_infsa_scores, linear_infsa_scores, infsa_attention
β”‚   β”œβ”€β”€ attention.py        # nn.Module: InfSAAttention (drop-in for nn.MultiheadAttention)
β”‚   └── convert.py          # Auto-conversion: convert(), replace_attention()
β”œβ”€β”€ examples/               # Runnable usage examples
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_infsa.py       # Unit tests (pytest)
β”‚   └── integration/        # End-to-end integration tests
β”‚       β”œβ”€β”€ helpers.py       # Shared test utilities
β”‚       └── run_all.py       # Run all integration tests
β”œβ”€β”€ datasets/               # Small sample datasets for integration tests
β”‚   β”œβ”€β”€ imnet/              # ImageNet validation XML annotations
β”‚   └── llm/                # LLM benchmark JSONL samples
β”œβ”€β”€ figures/                # Paper figures used in this README
└── pyproject.toml          # Package metadata

Citation

@article{roffo2025infsa,
  title={Infinite Self-Attention},
  author={Roffo, Giorgio},
  year={2025}
}

License

MIT License. See LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for groffo/infinite-self-attention