Spaces:
Sleeping
title: Attention Visualizer
emoji: π§
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
Transformer Attention Visualizer
An interactive visualization tool for exploring how transformer-based language models (like BERT) understand sentences internally using self-attention heatmaps.
Features
- Multi-model support β BERT Base, DistilBERT, GPT-2
- Per-layer, per-head attention heatmaps
- Average all heads mode
- Click-to-pin tokens β see what each token attends to
- Dark glassmorphism UI with smooth animations
- LRU model cache β loads once, reuses across requests
Quick Start
# One-shot launcher (installs deps + starts both servers)
chmod +x start.sh && ./start.sh
Then open http://localhost:5173
API docs available at http://localhost:8000/docs
Manual Setup
Backend
cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000
Frontend
cd frontend
npm install
npm run dev
Architecture
frontend (React + Plotly) β /api/attend (FastAPI) β HuggingFace + PyTorch
port 5173 port 8000
Models
| Model | Layers | Heads | Type | Size |
|---|---|---|---|---|
| bert-base-uncased | 12 | 12 | Encoder | 440MB |
| distilbert-base-uncased | 6 | 12 | Encoder | 265MB |
| gpt2 | 12 | 12 | Decoder | 548MB |
Models are downloaded automatically from HuggingFace on first use and cached locally.
API
GET /api/models β list of available models
POST /api/attend β { text, model_id } β { tokens, attentions, n_layers, n_heads }
GET /api/health β { status: "ok" }
This project provides a full-stack implementation using:
- FastAPI backend
- Hugging Face Transformers
- PyTorch inference
- React frontend
- Plotly attention visualization
It allows users to inspect attention behavior across tokens, heads, and layers to understand how contextual meaning is built inside transformer architectures.
Project Goal
This tool helps users answer one key question:
How does a transformer model understand language internally?
By visualizing attention matrices, we can observe:
- token relationships
- grammatical structure learning
- semantic reasoning
- sentence-level representation formation
in real time.
Example Visualization
Example sentence:
The cat sat on the mat and watched the dog.
Tokenized form:
[CLS] the cat sat on the mat and watched the dog . [SEP]
Each heatmap cell represents:
How much one token attends to another token
Rows:
Query token (who is looking)
Columns:
Key token (who is being looked at)
Color intensity represents attention strength.
Transformer Attention Mechanism
Self-attention is computed as:
Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
Meaning:
- Each token generates a query vector
- Each token generates a key vector
- Queries compare against keys
- Similarity scores become attention weights
- Output representation is updated
The heatmap visualizes these normalized attention weights.
Understanding the Heatmap
Color Interpretation
| Color | Meaning |
|---|---|
| Dark | Low attention |
| Purple | Medium attention |
| Yellow | Strong attention |
Example:
watched -> dog
Represents a strong verb-object relationship.
Example:
the -> cat
Represents article-noun binding.
Role of Special Tokens
[CLS]
Represents entire sentence summary.
Used for:
- classification
- semantic similarity
- retrieval embeddings
- sentiment detection
If many tokens attend to [CLS], the model is building a global sentence representation.
[SEP]
Represents sentence boundary.
Often used for:
- segmentation
- sentence compression
- sequence framing
Late transformer layers frequently route information into [SEP].
Layer-wise Attention Behavior
Transformer layers progressively refine meaning.
| Layer Range | Model Behavior |
|---|---|
| Layer 1-2 | Token identity stabilization |
| Layer 3-6 | Grammar learning |
| Layer 7-10 | Phrase relationships |
| Layer 11-12 | Sentence-level semantics |
Early Layer Example (Layer 1)
Observed pattern:
cat -> cat
sat -> sat
mat -> mat
Meaning:
Tokens attend mostly to themselves.
Interpretation:
Early layers confirm token identity before contextual reasoning begins.
Example screenshot:
Insert Layer 1 Heatmap Screenshot Here
Boundary Detection Heads
Observed pattern:
tokens -> [CLS]
tokens -> [SEP]
Interpretation:
Model identifies sentence start and end anchors.
These heads help construct positional awareness.
Example screenshot:
Insert Layer 1 Head 2 Screenshot Here
Middle Layer Example (Layer 5)
Observed pattern:
on -> sat
the -> mat
watched -> dog
Interpretation:
Model captures grammatical relationships:
- preposition to verb
- article to noun
- verb to object
These are syntactic reasoning heads.
Example screenshot:
Insert Layer 5 Screenshot Here
Late Layer Example (Layer 11)
Observed pattern:
all tokens -> [SEP]
Interpretation:
Model compresses sentence meaning into a global representation token.
This stage prepares embeddings for:
- classification
- semantic similarity
- retrieval pipelines
Example screenshot:
Insert Layer 11 Screenshot Here
Multi-Head Attention Behavior
Each transformer layer contains multiple heads.
Each head learns a different linguistic feature.
Typical head specializations:
| Head Type | Role |
|---|---|
| Positional | token order |
| Syntactic | grammar links |
| Semantic | meaning similarity |
| Boundary | CLS / SEP anchors |
| Long-range | clause connections |
Switching heads reveals different reasoning strategies.
Example Attention Insights From This Tool
Sentence:
The cat sat on the mat and watched the dog
Model internally builds:
Layer 1:
token identity
Layer 2:
article -> noun
Layer 5:
subject -> verb
Layer 8:
clause linking via "and"
Layer 11:
sentence representation compression
This reflects how transformer reasoning evolves step-by-step.
Why This Tool Is Useful
This visualizer helps researchers and engineers:
- inspect model reasoning
- debug hallucinations
- analyze token influence
- study linguistic structure learning
- understand embedding formation
Similar tools are used in transformer interpretability research.
Tech Stack
Backend:
- FastAPI
- PyTorch
- HuggingFace Transformers
Frontend:
- React
- Plotly
Visualization:
- attention matrices
- token relationships
- head-level reasoning
Future Improvements
Possible extensions:
- automatic head role labeling
- syntax vs semantic head detection
- cross-layer attention animation
- GPU acceleration support
- sentence embedding export
Summary
This project demonstrates how transformers progressively construct meaning from text.
From token identity to grammar to semantic understanding, attention heatmaps provide a transparent window into model reasoning.
This makes the system valuable for:
- AI engineers
- NLP researchers
- students learning transformers
- interpretability research




