---
license: mit
pipeline_tag: text-generation
tags: [research, experimental, gravity-attention, qwen2]
---

# Gravity-2

![IMAGE 2026-06-16 19:46:27](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/NkBrWQXsuwZUrsTaYga5-.jpeg)

**Experimental research model by squ11z1.** 

A 3B reasoning model in which the standard
scaled-dot-product attention is replaced by a physically-motivated **gravity attention**,
then adapted with LoRA. This card documents a **stage-1 proof-of-mechanism**

## The experiment

Transformer attention scores tokens by **alignment** — the dot product `q·k`. Gravity-2
asks a different question: *what if tokens attended by **proximity** instead?* We replace
the score with an inverse-square law borrowed from gravitation — each token is pulled
toward others that are close in query/key space, weighted by a learnable per-head "mass":

```
                         M_h²
score(i, j)  =  ─────────────────────          →   softmax_j( score )
                  ‖q_i − k_j‖²  +  ε
```

- **M_h = softplus(gravity_mass_log[h])** — one learnable mass per **query head** (16 / layer),
  initialised at 0.5; `softplus` keeps it strictly positive.
- **‖q_i − k_j‖²** — squared L2 distance, computed stably as `‖q‖² + ‖k‖² − 2·q·k`.
- **ε = 0.1** — softening length; prevents the `q → k` singularity.
- The raw gravity scores are then passed through the **usual softmax** (see Limitations).

### Why it's interesting
- **Different inductive bias.** Dot-product attention rewards directional alignment;
  inverse-distance rewards *locality* in the learned embedding geometry — a metric prior
  rather than an inner-product one.
- **Interpretable per-head masses.** Each head learns a scalar "mass" controlling how
  sharply it concentrates — a compact, inspectable knob (see `figures/04_mass_heatmap.png`).
- **A bridge to physics-style sparsity.** An inverse-square field is naturally local, which
  later stages (pruning / QUBO, "Gravity-6") aim to exploit for structured sparsity.

## Architecture
Qwen2-3B class: 36 layers, hidden 2048, **16 query heads / 2 KV heads (GQA, group size 8)**,
head_dim 128. The 2 KV heads are `repeat_kv`-expanded to 16 before the distance, so each
query head gets its own mass. Integrated via the transformers-5.x `AttentionInterface`
(a registered `"gravity"` op + eager causal-mask reuse) — RoPE / KV-cache / masking are
left to the framework; only the score function changes.

## Results

| | |
|---|---|
| ![loss](figures/01_loss.png) | ![masses](figures/02_mass_band.png) |
| ![grad](figures/03_gradnorm.png) | ![heatmap](figures/04_mass_heatmap.png) |
| ![aer](figures/05_aer_entropy.png) | ![concept](figures/06_concept.png) |

## Honest limitations
- **Not "pure" gravity.** The inverse-square scores are renormalised by a **softmax on top**
  (`softmax_j(M²/(d²+ε))`). Without it training was unstable, but it means this is a
  *distance-biased softmax attention*, not a literal gravitational field — the normalisation
  reintroduces global competition between keys.
- **MHA → GQA transfer is an open question.** The mechanism was first prototyped on MHA
  (1 KV head per query head). Here it runs on GQA by `repeat_kv`-expanding 2 KV heads to 16
  and giving each query head its own mass; whether this is the right granularity (vs. one
  mass per KV group) is **unresolved** and may matter for convergence.
- **Loading requires the patch** (below). **GGUF builds run standard attention, not gravity**
  (llama.cpp has no kernel for `M²/(‖q−k‖²+ε)`) — the `*.gguf` files are format placeholders
  and produce incorrect output.

## Loading (requires the gravity patch)
```bash
python load_gravity2.py   # from_pretrained -> patch_qwen_with_gravity -> load gravity_mass_log.pt
```
Weights are LoRA-merged into the base but were trained under gravity scoring; loading them
under vanilla attention gives garbage. `config.json` ships `_attn_implementation="eager"`
only so the checkpoint loads — the patch switches it to gravity.

## License & attribution
Released under the **MIT License**. This is a **derivative work of
[`WeiboAI/VibeThinker-3B`](https://huggingface.co/WeiboAI/VibeThinker-3B)** (the base model
for the experiment), which is distributed under the **MIT License**; that license is
inherited here and the original authors are credited accordingly.