Gravity-2 / README.md
squ11z1's picture
Update README.md
6b21a19 verified
|
Raw
History Blame Contribute Delete
4.46 kB
---
license: mit
pipeline_tag: text-generation
tags: [research, experimental, gravity-attention, qwen2]
---
# Gravity-2
![IMAGE 2026-06-16 19:46:27](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/NkBrWQXsuwZUrsTaYga5-.jpeg)
**Experimental research model by squ11z1.**
A 3B reasoning model in which the standard
scaled-dot-product attention is replaced by a physically-motivated **gravity attention**,
then adapted with LoRA. This card documents a **stage-1 proof-of-mechanism**
## The experiment
Transformer attention scores tokens by **alignment** β€” the dot product `qΒ·k`. Gravity-2
asks a different question: *what if tokens attended by **proximity** instead?* We replace
the score with an inverse-square law borrowed from gravitation β€” each token is pulled
toward others that are close in query/key space, weighted by a learnable per-head "mass":
```
M_hΒ²
score(i, j) = ───────────────────── β†’ softmax_j( score )
β€–q_i βˆ’ k_jβ€–Β² + Ξ΅
```
- **M_h = softplus(gravity_mass_log[h])** β€” one learnable mass per **query head** (16 / layer),
initialised at 0.5; `softplus` keeps it strictly positive.
- **β€–q_i βˆ’ k_jβ€–Β²** β€” squared L2 distance, computed stably as `β€–qβ€–Β² + β€–kβ€–Β² βˆ’ 2Β·qΒ·k`.
- **Ξ΅ = 0.1** β€” softening length; prevents the `q β†’ k` singularity.
- The raw gravity scores are then passed through the **usual softmax** (see Limitations).
### Why it's interesting
- **Different inductive bias.** Dot-product attention rewards directional alignment;
inverse-distance rewards *locality* in the learned embedding geometry β€” a metric prior
rather than an inner-product one.
- **Interpretable per-head masses.** Each head learns a scalar "mass" controlling how
sharply it concentrates β€” a compact, inspectable knob (see `figures/04_mass_heatmap.png`).
- **A bridge to physics-style sparsity.** An inverse-square field is naturally local, which
later stages (pruning / QUBO, "Gravity-6") aim to exploit for structured sparsity.
## Architecture
Qwen2-3B class: 36 layers, hidden 2048, **16 query heads / 2 KV heads (GQA, group size 8)**,
head_dim 128. The 2 KV heads are `repeat_kv`-expanded to 16 before the distance, so each
query head gets its own mass. Integrated via the transformers-5.x `AttentionInterface`
(a registered `"gravity"` op + eager causal-mask reuse) β€” RoPE / KV-cache / masking are
left to the framework; only the score function changes.
## Results
| | |
|---|---|
| ![loss](figures/01_loss.png) | ![masses](figures/02_mass_band.png) |
| ![grad](figures/03_gradnorm.png) | ![heatmap](figures/04_mass_heatmap.png) |
| ![aer](figures/05_aer_entropy.png) | ![concept](figures/06_concept.png) |
## Honest limitations
- **Not "pure" gravity.** The inverse-square scores are renormalised by a **softmax on top**
(`softmax_j(MΒ²/(dΒ²+Ξ΅))`). Without it training was unstable, but it means this is a
*distance-biased softmax attention*, not a literal gravitational field β€” the normalisation
reintroduces global competition between keys.
- **MHA β†’ GQA transfer is an open question.** The mechanism was first prototyped on MHA
(1 KV head per query head). Here it runs on GQA by `repeat_kv`-expanding 2 KV heads to 16
and giving each query head its own mass; whether this is the right granularity (vs. one
mass per KV group) is **unresolved** and may matter for convergence.
- **Loading requires the patch** (below). **GGUF builds run standard attention, not gravity**
(llama.cpp has no kernel for `MΒ²/(β€–qβˆ’kβ€–Β²+Ξ΅)`) β€” the `*.gguf` files are format placeholders
and produce incorrect output.
## Loading (requires the gravity patch)
```bash
python load_gravity2.py # from_pretrained -> patch_qwen_with_gravity -> load gravity_mass_log.pt
```
Weights are LoRA-merged into the base but were trained under gravity scoring; loading them
under vanilla attention gives garbage. `config.json` ships `_attn_implementation="eager"`
only so the checkpoint loads β€” the patch switches it to gravity.
## License & attribution
Released under the **MIT License**. This is a **derivative work of
[`WeiboAI/VibeThinker-3B`](https://huggingface.co/WeiboAI/VibeThinker-3B)** (the base model
for the experiment), which is distributed under the **MIT License**; that license is
inherited here and the original authors are credited accordingly.