File size: 8,446 Bytes
b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e b4df97c 3398b2e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
---
library_name: transformers
datasets:
- HuggingFaceH4/MATH-500
- akoksal/LongForm
pipeline_tag: text-generation
---
# MoA-Metric-LM-400M (Convergent)
A compact-but-capable ≈400M parameter causal LM that replaces dot-product attention with metric-native attention and augments sequence geometry with BlackHoleRoPE (a learnable, stable RoPE variant). Designed to train and run on modest hardware (CPU-first friendly) while staying fully compatible with 🤗 Transformers.
Tags: Convergent · MoA · Conversational · Research
Datasets: yzhuang/Agentic-Long-Context-Understanding-QA, HuggingFaceH4/MATH-500
License: Apache-2.0
⸻
# Why this model?
• Distance scores, not dot products. Heads score with L2, cosine, or diag-Mahalanobis distances. This gives direct control over geometry, often stabilizes training, and can be more sample-efficient.
• BlackHoleRoPE positional encoding.
• Q/K: pure unit-modulus rotation (unitary → numerically stable).
• V: bounded-energy gating (Penrose-inspired), optionally modulated by a discrepancy signal.
• Parameters synthesized from a tiny Fourier basis → extrapolable and cache-friendly, with low memory.
• MoA (Mixture-of-Architectures) block. Token-wise router softly blends four heads per block:
1. LocalConv (depthwise token-local conv)
2. MetricMHAttention (multi-head metric attention)
3. ChannelMix (MLP)
4. MetricMQA (multi-query, shared K/V)
• Triangle-Inequality (TI) regularizer. Keeps metric heads honest by penalizing violations over random triples.
• Runs on CPUs. Implemented to behave well in FP32 on AVX2/AVX-512 machines.
⸻
## Model at a glance
Property Value
Parameters ~400 M (exact count depends on vocab; see config.json)
Layers 12–24 depending on variant (MoA blocks)
Hidden size ≥ 1024 in the 400 M variant (head dim divisible by #heads)
Attention Metric-native (L2 / cosine / diag-Mahalanobis), plus MetricMQA
Positional BlackHoleRoPE per-head (rope_global for MH-Attn, rope_mqa for MQA)
Router Token-wise soft mixture across the four heads (+ optional bias gate)
FFN HyperFFN = SwiGLU MLP + SepConv1d + Low-Rank path (router-mixed)
Context Trained primarily at 512–1024 tokens; config allows up to 2048
Precision Training FP32 (CPU-friendly); inference FP32/BF16/FP16 supported
License Apache-2.0
Note on context: training emphasized 512–1024; BlackHoleRoPE is extrapolable, but throughput and quality beyond training lengths depend on your hardware and data.
⸻
Intended use & limitations
Intended: compact assistants, long-context reading/QA, math-style step reasoning, research on distance-based attention and geometric inductive biases.
Not intended: safety-critical use, heavy factual QA at web scale, or domains requiring guaranteed accuracy. Evaluate carefully before deployment.
⸻
## Datasets
• Agentic-Long-Context-Understanding-QA — long-range reading/retrieval questions to exercise context tracking. ~256000 tokens
• MATH-500 — small curated math prompts for stepwise reasoning. ~256000 tokens
Training used modest token budgets (hundreds of thousands). Reported training logs showed healthy loss descent on both 512 and 1024 sequence lengths on CPU runs. Exact metrics will vary with tokenizer, preprocessing, and optimizer settings.
⸻
```python
Installation
pip install transformers accelerate sentencepiece
⸻
Quick start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "reaperdoesntknow/MoA-400M" # replace with your repo id
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
repo, torch_dtype=torch.float32, device_map="cpu"
).eval()
prompt = "Read and answer: If 3x + 2 = 17, what is x?\nReasoning:"
inputs = tok(prompt, return_tensors="pt")
with torch.no_grad():
out = model.generate(
**inputs,
max_length=256,
do_sample=True,
top_p=0.9,
temperature=0.8,
pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0], skip_special_tokens=True))
Pipeline usage
from transformers import pipeline
repo = "reaperdoesntknow/MoA-400M"
pipe = pipeline("text-generation", model=repo, device_map="cpu")
print(
pipe(
"Question: Who wrote 'The Selfish Gene'?\nAnswer:",
max_length=128,
do_sample=False,
)[0]["generated_text"]
)
```
⸻
## Architecture details
Metric attention (MH)
• Scores:
• L2: -||q-k||² / sqrt(d)
• Cosine: normalized dot → scaled
• diag-Mahalanobis: per-head diagonal scale on dimensions
• Stability: logits scaled by a learnable α; optional radius-based pruning mask for efficiency.
• Value path: post-attention Up/Down projector (gated) for expressive value mixing.
Metric MQA (shared K/V)
• K and V are shared (single projection) and broadcast; queries remain multi-head. Useful for throughput and memory.
## BlackHoleRoPE
• Q/K rotation only (unit modulus) → preserves norms; avoids value blow-ups.
• V receives bounded-energy amplification (energy_min..energy_max) with optional discrepancy modulation.
• Parameters synthesized from a small Fourier basis; reduces cache size and improves length generalization.
Routing & gates
• TokenRouter: per-token weights over {LocalConv, MetricMH, ChannelMix, MetricMQA}.
• Feature gates: per-head multiplicative scales in (0, 2) around 1.0.
• Optional router bias adds signed offsets before softmax.
Triangle-Inequality regularizer
• Lightweight penalty on random triples to discourage degenerate metric geometry.
⸻
Training recipe (reference)
• Device: CPU (AVX2/AVX-512 recommended).
• Precision: FP32.
• Optimizer: AdamW or Adam (β₁=0.9, β₂=0.95–0.999 work); cosine LR or linear warmup.
• Batch/seq: [batch, seq] = [2–4, 512–1024].
• Regularization: modest dropout in attention/value paths; optional TI penalty.
If you see NaN/Inf during sampling, ensure masks are additive 0/-inf, clamp logits when rows are fully masked, and set a pad_token_id in .generate().
⸻
Evaluation notes
The model targets behavioral quality per FLOP rather than leaderboard chasing. On held-out long-context QA and small math checks, it shows:
• Robust token-to-token coherence at 512–1024.
• Stable generation on CPU with FP32.
• Competitive loss trends versus dot-product baselines trained under the same compute.
Please share issues/benchmarks via the repo so results can be tracked.
⸻
How to fine-tune
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
repo = "reaperdoesntknow/MoA-400M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo)
ds = load_dataset("yzhuang/Agentic-Long-Context-Understanding-QA", split="train[:2%]")
def tok_fn(ex):
x = tok(
ex["question"] + "\n" + ex["context"] + "\nAnswer:",
truncation=True,
max_length=1024,
)
x["labels"] = x["input_ids"].copy()
return x
tds = ds.map(tok_fn, remove_columns=ds.column_names)
args = TrainingArguments(
output_dir="./moa400m-finetune",
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
num_train_epochs=1,
learning_rate=5e-4,
weight_decay=0.0,
warmup_steps=100,
logging_steps=10,
save_steps=200,
fp16=False,
bf16=False,
)
trainer = Trainer(model=model, args=args, train_dataset=tds)
trainer.train()
```
⸻
Known behaviors / tips
• Context > 1024: works, but CPU throughput drops; BlackHoleRoPE helps stability, not throughput.
• Sampling: always pass pad_token_id (often eos_token_id) to .generate(); avoid temperature > 1.2 on small models.
• KV cache: supported; for CPU you may prefer smaller beams and greedy/small-temperature sampling.
⸻
Safety & responsibility
This is a research model. It was trained on public datasets and may produce incorrect or biased content. Do not rely on it for advice or sensitive decisions.
⸻
Citation
@software{moa_metric_lm_400m,
title = {MoA-Metric-LM-400M: Distance-based attention with BlackHoleRoPE},
author = {reaperdoesntknow},
year = {2025},
url = {https://huggingface.co/reaperdoesntknow/MoA-400M}
}
⸻
Acknowledgements
Built with 🤗 Transformers and a metric-first rethinking of attention. BlackHoleRoPE draws inspiration from symplectic/rotational encodings and bounded-energy dynamics. |