YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
InfSA β Infinite Self-Attention
A spectral, graph-theoretic reformulation of self-attention for Transformers
Paper (arXiv) β’ Install β’ Quick Start β’ LLMs β’ Vision Models β’ API β’ Results
InfSA (Infinite Self-Attention) replaces standard softmax(QK^T/βd) attention with a spectral diffusion mechanism grounded in graph centrality theory. Each attention layer becomes a diffusion step on a content-adaptive token graph, and token importance is determined by eigenvector centrality β the same principle behind PageRank and Katz ranking β rather than by local query-key affinity.
Softmax attention distributes focus across background regions, while InfSA variants produce sharper, object-aligned activations.
Key properties
- Drop-in compatible β works with any Transformer: ViT, DINO, RT-DETR, GPT, LLaMA, Qwen, BERT, etc.
- Two variants:
pure_infsa(quadratic, highest quality) andlinear_infsa(O(N), scales to 332K tokens) - Learnable Ο β the spectral decay parameter adapts during training via sigmoid reparameterization
- One-liner conversion β
model = infsa.convert(model, variant="pure_infsa") - 13Γ faster than standard ViT at 1024Β² resolution (Linear-InfSA)
- +3.2 pp ImageNet-1K accuracy gain (84.7% vs 81.5%) as a pure architectural improvement
Installation
pip install infsa
Or from source:
cd InfSA_release
pip install -e .
Requirements: torch >= 1.13.0. No other dependencies.
Quick Start
One-liner: convert any existing model
import infsa
# Replace ALL attention layers with InfSA β weights are copied automatically
model = infsa.convert(model, variant="pure_infsa")
That's it. Works with torchvision, HuggingFace, timm, or any custom Transformer.
Use with Vision Models
Torchvision ViT
import infsa
from torchvision.models import vit_b_16, ViT_B_16_Weights
model = vit_b_16(weights=ViT_B_16_Weights.DEFAULT)
model = infsa.convert(model, variant="pure_infsa")
# All 12 attention layers now use InfSA. Fine-tune as usual.
DINO / DINOv2
import torch
import infsa
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
model = infsa.convert(model, variant="pure_infsa", rho_init=0.9)
# DINOv2 ViT-B/14 now uses InfSA attention in all 12 layers
HuggingFace ViT
import infsa
from transformers import ViTForImageClassification
from transformers.models.vit.modeling_vit import ViTSelfAttention
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
model = infsa.convert(model, variant="pure_infsa", target_types=[ViTSelfAttention])
RT-DETR (Object Detection)
import infsa
from transformers import RTDetrForObjectDetection
model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd")
# Convert encoder attention to InfSA
model = infsa.convert(
model, variant="pure_infsa",
include_patterns=[r"encoder\.layers\."],
)
timm ViT / Swin / DeiT
import timm
import infsa
model = timm.create_model('vit_base_patch16_224', pretrained=True)
model = infsa.convert(model, variant="linear_infsa")
Use with LLMs
InfSA is modality-agnostic β the graph diffusion principles apply to language tokens just as they do to image patches.
Qwen3-8B
import infsa
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="auto")
# Convert all attention layers to InfSA
model = infsa.convert(model, variant="pure_infsa", rho_init=0.9)
# Fine-tune with your data β Ο is learnable per layer
GPT-2
import infsa
from transformers import AutoModel
model = AutoModel.from_pretrained("gpt2")
model = infsa.convert(model, variant="pure_infsa")
# All 12 attention layers replaced β
LLaMA / Mistral / Gemma
import infsa
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = infsa.convert(model, variant="linear_infsa", rho_init=0.9)
Custom LLM from scratch
import infsa
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, embed_dim=4096, num_heads=32):
super().__init__()
self.attn = infsa.InfSAAttention(
embed_dim=embed_dim,
num_heads=num_heads,
variant="pure_infsa",
rho_init=0.9,
rho_trainable=True,
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ff = nn.Sequential(
nn.Linear(embed_dim, embed_dim * 4),
nn.GELU(),
nn.Linear(embed_dim * 4, embed_dim),
)
def forward(self, x):
x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0]
x = x + self.ff(self.norm2(x))
return x
Functional API (maximum flexibility)
For custom attention implementations where you control Q/K/V directly:
import infsa
# Inside your custom forward method, after Q/K/V projection:
# q, k, v shape: (batch, heads, seq_len, head_dim)
output = infsa.infsa_attention(q, k, v, variant="pure_infsa", rho=0.9)
Selective Conversion
Convert only specific layers, or exclude certain parts of a model:
# Only convert the first 6 layers
model = infsa.convert(model, include_patterns=[r"layer\.[0-5]\."])
# Convert everything except the decoder
model = infsa.convert(model, exclude_patterns=[r"decoder\."])
# Only convert specific module types
from some_model import CustomAttention
model = infsa.convert(model, target_types=[CustomAttention])
InfSA Variants
| Variant | Complexity | Output | Best for |
|---|---|---|---|
pure_infsa |
O(NΒ²Β·D) | NΓN attention matrix | Quality-critical tasks, short-medium sequences |
linear_infsa |
O(NΒ·D) | NΓ1 importance vector | Long sequences, high resolution, efficiency-critical |
How it works
Standard softmax attention:
Pure InfSA replaces softmax with Frobenius-normalized ReLU, turning each layer into a spectral diffusion step:
Across $L$ Transformer layers, the accumulated output implements a truncated Neumann series:
This is equivalent to computing Katz centrality on the token graph β the same mathematical object underlying PageRank and the fundamental matrix of absorbing Markov chains.
Linear InfSA approximates the principal eigenvector of the implicit attention operator in O(N) time:
Absorbing Markov chain interpretation: InfSA's multi-hop propagation correctly identifies globally important tokens that single-hop softmax attention misses.
API Reference
infsa.convert(model, variant, **kwargs)
Replace all attention layers in any model.
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
nn.Module |
β | Model to convert |
variant |
str |
"pure_infsa" |
"pure_infsa" or "linear_infsa" |
rho_init |
float |
0.95 |
Initial spectral decay Ο |
rho_trainable |
bool |
True |
Make Ο learnable |
copy_weights |
bool |
True |
Copy existing Q/K/V/O weights |
include_patterns |
list[str] |
None |
Regex: only convert matching module names |
exclude_patterns |
list[str] |
None |
Regex: skip matching module names |
target_types |
list[Type] |
None |
Only convert specific module types |
inplace |
bool |
False |
Modify model in place |
infsa.InfSAAttention(embed_dim, num_heads, variant, **kwargs)
Drop-in nn.Module replacement for nn.MultiheadAttention.
| Parameter | Type | Default | Description |
|---|---|---|---|
embed_dim |
int |
β | Total embedding dimension |
num_heads |
int |
β | Number of attention heads |
variant |
str |
"pure_infsa" |
"pure_infsa" or "linear_infsa" |
dropout |
float |
0.0 |
Dropout on attention scores |
batch_first |
bool |
True |
Input shape (B, N, E) vs (N, B, E) |
rho_init |
float |
0.95 |
Initial Ο value |
rho_trainable |
bool |
True |
Whether Ο adapts during training |
infsa.infsa_attention(q, k, v, variant, rho)
Functional API on pre-projected tensors (B, H, N, D).
infsa.pure_infsa_scores(q, k, rho) / infsa.linear_infsa_scores(q, k, rho)
Compute raw attention scores without applying to values.
Experimental Results
Classification Accuracy
ImageNet-1K and ImageNet-V2 top-1 accuracy vs. parameter count. InfViT variants (red stars) achieve competitive or superior accuracy. On ImageNet-V2, all InfViT models exceed every baseline (up to 79.8% vs 76.8%).
Key results (4-layer ViT, 53.5M params, DeiT recipe, no distillation):
| Model | IN-1K Top-1 | IN-V2 Top-1 | Params | GFLOPs |
|---|---|---|---|---|
| Standard ViT 4L | 81.5% | β | 57.7M | 17.5 |
| Linear InfViT 4L | 84.7% | 79.8% | 53.5M | 59 |
| Pure InfViT 4L | 85.1% | 79.5% | 57.7M | 59 |
| Pure InfViT 24L | 85.4% | 79.8% | 330.6M | β |
The +3.2 pp gain of Linear InfViT over Standard ViT is purely architectural β same training recipe, no external data.
Efficiency & Scalability
Efficiency dashboard at 1024Β²: Linear InfViT achieves 13.4Γ speed-up, 231 img/s throughput, and 0.87 J/img β best in every metric.
Throughput vs. energy per image for nine attention mechanisms, colored by asymptotic complexity. InfViT Linear (O(N), red star) dominates the top-left corner.
| Metric | Standard ViT | Linear InfViT | Improvement |
|---|---|---|---|
| Throughput (1024Β²) | 17.19 img/s | 230.95 img/s | 13.4Γ |
| Energy (1024Β²) | 11.63 J/img | 0.87 J/img | 13.4Γ |
| Max resolution | 1024Β² | 9216Β² (332K tokens) | Only model to complete |
| Inference latency | 58.16 ms | 4.33 ms | 13.4Γ |
Training and inference latency across resolutions. Linear InfViT scales near-linearly to 9216Β² while all other models OOM beyond 1024Β².
More scalability plots
4-layer vs 24-layer depth comparison: throughput and energy.
Attention Quality
Attention quality: MoRF-AOC, ROC-AUC, and PR-AUC (%). InfSA variants outperform Standard ViT by 20β34 pp across all metrics.
| Metric | Standard ViT | Pure InfSA | Linear InfSA |
|---|---|---|---|
| MoRF-AOC β | 42.6% | 71.7% | 76.0% |
| ROC-AUC β | 53.8% | 77.3% | 75.9% |
| PR-AUC β | 56.2% | 72.4% | 76.1% |
Bounding-box localization: InfSA attention maps align precisely with object boundaries.
More attention quality plots
MoRF degradation and LeRF retention curves.
Examples
| Example | Description |
|---|---|
| vit_torchvision.py | Convert a pretrained torchvision ViT-B/16 |
| huggingface_vit.py | Convert a HuggingFace ViT |
| huggingface_llm.py | InfSA in GPT-2 and custom LLMs |
| custom_transformer.py | Build a ViT from scratch with InfSA |
Testing
Unit tests
pytest tests/test_infsa.py -v
Integration tests
End-to-end tests that validate InfSA on real model architectures with synthetic inputs:
| Test | Model | What it validates |
|---|---|---|
test_torchvision_vit.py |
torchvision ViT-B/16 | infsa.convert() on PyTorch ViT, inference + training + gradient flow |
test_huggingface_vit.py |
HuggingFace ViT | Custom wrapper for HF ViTSelfAttention, weight copying |
test_detr.py |
DETR (ResNet-50) | Monkey-patching DETR's cross/self-attention with InfSA functional API |
test_custom_transformer.py |
Custom ViT + MiniLLM | Building from scratch with InfSAAttention, nn.TransformerEncoder conversion |
test_qwen3_llm.py |
Qwen3-0.6B | Auto-converting a full LLM, inference on SQuAD/GSM8K |
Run all integration tests:
python tests/integration/run_all.py
Results are written to results/.
Project Structure
InfSA_release/
βββ infsa/ # Core library (pip-installable)
β βββ __init__.py # Public API exports
β βββ core.py # Functional: pure_infsa_scores, linear_infsa_scores, infsa_attention
β βββ attention.py # nn.Module: InfSAAttention (drop-in for nn.MultiheadAttention)
β βββ convert.py # Auto-conversion: convert(), replace_attention()
βββ examples/ # Runnable usage examples
βββ tests/
β βββ test_infsa.py # Unit tests (pytest)
β βββ integration/ # End-to-end integration tests
β βββ helpers.py # Shared test utilities
β βββ run_all.py # Run all integration tests
βββ datasets/ # Small sample datasets for integration tests
β βββ imnet/ # ImageNet validation XML annotations
β βββ llm/ # LLM benchmark JSONL samples
βββ figures/ # Paper figures used in this README
βββ pyproject.toml # Package metadata
Citation
@article{roffo2025infsa,
title={Infinite Self-Attention},
author={Roffo, Giorgio},
year={2025}
}
License
MIT License. See LICENSE.