--- license: mit pipeline_tag: text-generation tags: [research, experimental, gravity-attention, qwen2] --- # Gravity-2 ![IMAGE 2026-06-16 19:46:27](https://cdn-uploads.huggingface.co/production/uploads/67329d3f69fded92d56ab41a/NkBrWQXsuwZUrsTaYga5-.jpeg) **Experimental research model by squ11z1.** A 3B reasoning model in which the standard scaled-dot-product attention is replaced by a physically-motivated **gravity attention**, then adapted with LoRA. This card documents a **stage-1 proof-of-mechanism** ## The experiment Transformer attention scores tokens by **alignment** — the dot product `q·k`. Gravity-2 asks a different question: *what if tokens attended by **proximity** instead?* We replace the score with an inverse-square law borrowed from gravitation — each token is pulled toward others that are close in query/key space, weighted by a learnable per-head "mass": ``` M_h² score(i, j) = ───────────────────── → softmax_j( score ) ‖q_i − k_j‖² + ε ``` - **M_h = softplus(gravity_mass_log[h])** — one learnable mass per **query head** (16 / layer), initialised at 0.5; `softplus` keeps it strictly positive. - **‖q_i − k_j‖²** — squared L2 distance, computed stably as `‖q‖² + ‖k‖² − 2·q·k`. - **ε = 0.1** — softening length; prevents the `q → k` singularity. - The raw gravity scores are then passed through the **usual softmax** (see Limitations). ### Why it's interesting - **Different inductive bias.** Dot-product attention rewards directional alignment; inverse-distance rewards *locality* in the learned embedding geometry — a metric prior rather than an inner-product one. - **Interpretable per-head masses.** Each head learns a scalar "mass" controlling how sharply it concentrates — a compact, inspectable knob (see `figures/04_mass_heatmap.png`). - **A bridge to physics-style sparsity.** An inverse-square field is naturally local, which later stages (pruning / QUBO, "Gravity-6") aim to exploit for structured sparsity. ## Architecture Qwen2-3B class: 36 layers, hidden 2048, **16 query heads / 2 KV heads (GQA, group size 8)**, head_dim 128. The 2 KV heads are `repeat_kv`-expanded to 16 before the distance, so each query head gets its own mass. Integrated via the transformers-5.x `AttentionInterface` (a registered `"gravity"` op + eager causal-mask reuse) — RoPE / KV-cache / masking are left to the framework; only the score function changes. ## Results | | | |---|---| | ![loss](figures/01_loss.png) | ![masses](figures/02_mass_band.png) | | ![grad](figures/03_gradnorm.png) | ![heatmap](figures/04_mass_heatmap.png) | | ![aer](figures/05_aer_entropy.png) | ![concept](figures/06_concept.png) | ## Honest limitations - **Not "pure" gravity.** The inverse-square scores are renormalised by a **softmax on top** (`softmax_j(M²/(d²+ε))`). Without it training was unstable, but it means this is a *distance-biased softmax attention*, not a literal gravitational field — the normalisation reintroduces global competition between keys. - **MHA → GQA transfer is an open question.** The mechanism was first prototyped on MHA (1 KV head per query head). Here it runs on GQA by `repeat_kv`-expanding 2 KV heads to 16 and giving each query head its own mass; whether this is the right granularity (vs. one mass per KV group) is **unresolved** and may matter for convergence. - **Loading requires the patch** (below). **GGUF builds run standard attention, not gravity** (llama.cpp has no kernel for `M²/(‖q−k‖²+ε)`) — the `*.gguf` files are format placeholders and produce incorrect output. ## Loading (requires the gravity patch) ```bash python load_gravity2.py # from_pretrained -> patch_qwen_with_gravity -> load gravity_mass_log.pt ``` Weights are LoRA-merged into the base but were trained under gravity scoring; loading them under vanilla attention gives garbage. `config.json` ships `_attn_implementation="eager"` only so the checkpoint loads — the patch switches it to gravity. ## License & attribution Released under the **MIT License**. This is a **derivative work of [`WeiboAI/VibeThinker-3B`](https://huggingface.co/WeiboAI/VibeThinker-3B)** (the base model for the experiment), which is distributed under the **MIT License**; that license is inherited here and the original authors are credited accordingly.