---
title: Attention Visualizer
emoji: 🧠
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---

# Transformer Attention Visualizer

An interactive visualization tool for exploring how transformer-based language models (like BERT) understand sentences internally using **self-attention heatmaps**.

![Tech Stack](https://img.shields.io/badge/stack-FastAPI%20%7C%20HuggingFace%20%7C%20React%20%7C%20Plotly-6366f1?style=flat-square)


## Features

- **Multi-model support** — BERT Base, DistilBERT, GPT-2
- **Per-layer, per-head** attention heatmaps
- **Average all heads** mode
- **Click-to-pin tokens** — see what each token attends to
- **Dark glassmorphism UI** with smooth animations
- LRU model cache — loads once, reuses across requests

## Quick Start

```bash
# One-shot launcher (installs deps + starts both servers)
chmod +x start.sh && ./start.sh
```

Then open **http://localhost:5173**

API docs available at **http://localhost:8000/docs**

## Manual Setup

### Backend

```bash
cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000
```

### Frontend

```bash
cd frontend
npm install
npm run dev
```

## Architecture

```
frontend (React + Plotly)  →  /api/attend (FastAPI)  →  HuggingFace + PyTorch
     port 5173                   port 8000
```

## Models

| Model | Layers | Heads | Type | Size |
|-------|--------|-------|------|------|
| bert-base-uncased | 12 | 12 | Encoder | 440MB |
| distilbert-base-uncased | 6 | 12 | Encoder | 265MB |
| gpt2 | 12 | 12 | Decoder | 548MB |

Models are downloaded automatically from HuggingFace on first use and cached locally.

## API

```
GET  /api/models   → list of available models
POST /api/attend   → { text, model_id } → { tokens, attentions, n_layers, n_heads }
GET  /api/health   → { status: "ok" }
```


This project provides a full-stack implementation using:

- FastAPI backend
- Hugging Face Transformers
- PyTorch inference
- React frontend
- Plotly attention visualization

It allows users to inspect attention behavior across **tokens, heads, and layers** to understand how contextual meaning is built inside transformer architectures.

---

# Project Goal

This tool helps users answer one key question:

> How does a transformer model understand language internally?

By visualizing attention matrices, we can observe:

- token relationships
- grammatical structure learning
- semantic reasoning
- sentence-level representation formation

in real time.

---

# Example Visualization

Example sentence:

```
The cat sat on the mat and watched the dog.
```

Tokenized form:

```
[CLS] the cat sat on the mat and watched the dog . [SEP]
```

Each heatmap cell represents:

```
How much one token attends to another token
```

Rows:

```
Query token (who is looking)
```

Columns:

```
Key token (who is being looked at)
```

Color intensity represents attention strength.

---

# Transformer Attention Mechanism

Self-attention is computed as:

```
Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
```

Meaning:

1. Each token generates a query vector
2. Each token generates a key vector
3. Queries compare against keys
4. Similarity scores become attention weights
5. Output representation is updated

The heatmap visualizes these normalized attention weights.

---

# Understanding the Heatmap

## Color Interpretation

| Color | Meaning |
|------|---------|
Dark | Low attention |
Purple | Medium attention |
Yellow | Strong attention |

Example:

```
watched -> dog
```

Represents a strong verb-object relationship.

Example:

```
the -> cat
```

Represents article-noun binding.

---

# Role of Special Tokens

## [CLS]

Represents entire sentence summary.

Used for:

- classification
- semantic similarity
- retrieval embeddings
- sentiment detection

If many tokens attend to `[CLS]`, the model is building a global sentence representation.

## [SEP]

Represents sentence boundary.

Often used for:

- segmentation
- sentence compression
- sequence framing

Late transformer layers frequently route information into `[SEP]`.

---

# Layer-wise Attention Behavior

Transformer layers progressively refine meaning.

| Layer Range | Model Behavior |
|------------|----------------|
Layer 1-2 | Token identity stabilization |
Layer 3-6 | Grammar learning |
Layer 7-10 | Phrase relationships |
Layer 11-12 | Sentence-level semantics |

---

# Early Layer Example (Layer 1)

Observed pattern:

```
cat -> cat
sat -> sat
mat -> mat
```

Meaning:

Tokens attend mostly to themselves.

Interpretation:

Early layers confirm token identity before contextual reasoning begins.

Example screenshot:

```
Insert Layer 1 Heatmap Screenshot Here
```

---

# Boundary Detection Heads

Observed pattern:

```
tokens -> [CLS]
tokens -> [SEP]
```

Interpretation:

Model identifies sentence start and end anchors.

These heads help construct positional awareness.

Example screenshot:

```
Insert Layer 1 Head 2 Screenshot Here
```

---

# Middle Layer Example (Layer 5)

Observed pattern:

```
on -> sat
the -> mat
watched -> dog
```

Interpretation:

Model captures grammatical relationships:

- preposition to verb
- article to noun
- verb to object

These are syntactic reasoning heads.

Example screenshot:

```
Insert Layer 5 Screenshot Here
```

---

# Late Layer Example (Layer 11)

Observed pattern:

```
all tokens -> [SEP]
```

Interpretation:

Model compresses sentence meaning into a global representation token.

This stage prepares embeddings for:

- classification
- semantic similarity
- retrieval pipelines

Example screenshot:

```
Insert Layer 11 Screenshot Here
```

---

# Multi-Head Attention Behavior

Each transformer layer contains multiple heads.

Each head learns a different linguistic feature.

Typical head specializations:

| Head Type | Role |
|----------|------|
Positional | token order |
Syntactic | grammar links |
Semantic | meaning similarity |
Boundary | CLS / SEP anchors |
Long-range | clause connections |

Switching heads reveals different reasoning strategies.

---

# Example Attention Insights From This Tool

Sentence:

```
The cat sat on the mat and watched the dog
```

Model internally builds:

Layer 1:

```
token identity
```
![Layer 1 Head 1 Attention](docs/images/layer01_Head_01.png)

Layer 2:

```
article -> noun
```
![Layer 2 Head 2 Attention](docs/images/layer02_Head_02.png)

Layer 5:

```
subject -> verb
```
![Layer 5 Head 5 Attention](docs/images/layer05_Head_06.png)

Layer 8:

```
clause linking via "and"
```
![Layer 8 Head 8 Attention](docs/images/layer05_Head11.png)

Layer 11:

```
sentence representation compression
```

![Layer 11 Head 11 Attention](docs/images/layer11_Head_01.png)

This reflects how transformer reasoning evolves step-by-step.

---

# Why This Tool Is Useful

This visualizer helps researchers and engineers:

- inspect model reasoning
- debug hallucinations
- analyze token influence
- study linguistic structure learning
- understand embedding formation

Similar tools are used in transformer interpretability research.

---

# Tech Stack

Backend:

- FastAPI
- PyTorch
- HuggingFace Transformers

Frontend:

- React
- Plotly

Visualization:

- attention matrices
- token relationships
- head-level reasoning

---

# Future Improvements

Possible extensions:

- automatic head role labeling
- syntax vs semantic head detection
- cross-layer attention animation
- GPU acceleration support
- sentence embedding export

---

# Summary

This project demonstrates how transformers progressively construct meaning from text.

From token identity to grammar to semantic understanding, attention heatmaps provide a transparent window into model reasoning.

This makes the system valuable for:

- AI engineers
- NLP researchers
- students learning transformers
- interpretability research