reaperdoesntknow commited on
Commit
70c1469
·
verified ·
1 Parent(s): 18ac594

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -144
README.md CHANGED
@@ -2,204 +2,236 @@
2
  library_name: transformers
3
  license: apache-2.0
4
  datasets:
5
- - QingyiSi/Alpaca-CoT
6
  - WeMake/Intelligent-Content-Understanding
 
 
 
7
  language:
8
  - en
9
  pipeline_tag: text-generation
10
  ---
11
 
12
- # Model Card for Model ID
13
-
14
- <!-- Provide a quick summary of what the model is/does. -->
15
-
16
-
17
-
18
- ## Model Details
19
-
20
- ### Model Description
21
-
22
- <!-- Provide a longer summary of what this model is. -->
23
-
24
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
25
-
26
- - **Developed by:** [More Information Needed]
27
- - **Funded by [optional]:** [More Information Needed]
28
- - **Shared by [optional]:** [More Information Needed]
29
- - **Model type:** [More Information Needed]
30
- - **Language(s) (NLP):** [More Information Needed]
31
- - **License:** [More Information Needed]
32
- - **Finetuned from model [optional]:** [More Information Needed]
33
-
34
- ### Model Sources [optional]
35
-
36
- <!-- Provide the basic links for the model. -->
37
-
38
- - **Repository:** [More Information Needed]
39
- - **Paper [optional]:** [More Information Needed]
40
- - **Demo [optional]:** [More Information Needed]
41
-
42
- ## Uses
43
-
44
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
45
-
46
- ### Direct Use
47
-
48
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
49
-
50
- [More Information Needed]
51
-
52
- ### Downstream Use [optional]
53
-
54
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
55
-
56
- [More Information Needed]
57
-
58
- ### Out-of-Scope Use
59
-
60
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
61
-
62
- [More Information Needed]
63
 
64
- ## Bias, Risks, and Limitations
65
-
66
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
67
-
68
- [More Information Needed]
69
-
70
- ### Recommendations
71
-
72
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
73
-
74
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
75
-
76
- ## How to Get Started with the Model
77
-
78
- Use the code below to get started with the model.
79
-
80
- [More Information Needed]
81
-
82
- ## Training Details
83
-
84
- ### Training Data
85
-
86
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
87
-
88
- [More Information Needed]
89
-
90
- ### Training Procedure
91
-
92
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
93
-
94
- #### Preprocessing [optional]
95
-
96
- [More Information Needed]
97
-
98
-
99
- #### Training Hyperparameters
100
-
101
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
102
-
103
- #### Speeds, Sizes, Times [optional]
104
 
105
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- [More Information Needed]
108
 
109
- ## Evaluation
110
 
111
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
 
 
 
112
 
113
- ### Testing Data, Factors & Metrics
114
 
115
- #### Testing Data
116
 
117
- <!-- This should link to a Dataset Card if possible. -->
118
 
119
- [More Information Needed]
120
 
121
- #### Factors
122
 
123
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
124
 
125
- [More Information Needed]
 
 
 
 
126
 
127
- #### Metrics
128
 
129
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
130
 
131
- [More Information Needed]
132
 
133
- ### Results
134
 
135
- [More Information Needed]
136
 
137
- #### Summary
138
 
 
 
139
 
 
140
 
141
- ## Model Examination [optional]
 
 
 
142
 
143
- <!-- Relevant interpretability work for the model goes here -->
 
144
 
145
- [More Information Needed]
 
 
 
 
 
 
 
 
146
 
147
- ## Environmental Impact
148
 
149
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
150
 
151
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
 
 
 
 
 
 
 
 
152
 
153
- - **Hardware Type:** [More Information Needed]
154
- - **Hours used:** [More Information Needed]
155
- - **Cloud Provider:** [More Information Needed]
156
- - **Compute Region:** [More Information Needed]
157
- - **Carbon Emitted:** [More Information Needed]
158
 
159
- ## Technical Specifications [optional]
160
 
161
- ### Model Architecture and Objective
 
 
 
 
 
 
162
 
163
- [More Information Needed]
 
164
 
165
- ### Compute Infrastructure
 
 
 
166
 
167
- [More Information Needed]
 
 
 
168
 
169
- #### Hardware
 
170
 
171
- [More Information Needed]
172
 
173
- #### Software
 
 
 
 
 
174
 
175
- [More Information Needed]
176
 
177
- ## Citation [optional]
178
 
179
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
180
 
181
- **BibTeX:**
 
 
 
182
 
183
- [More Information Needed]
184
 
185
- **APA:**
186
 
187
- [More Information Needed]
 
 
 
188
 
189
- ## Glossary [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
192
 
193
- [More Information Needed]
194
 
195
- ## More Information [optional]
196
 
197
- [More Information Needed]
 
198
 
199
- ## Model Card Authors [optional]
 
 
 
 
 
200
 
201
- [More Information Needed]
202
 
203
- ## Model Card Contact
204
 
205
- [More Information Needed]
 
2
  library_name: transformers
3
  license: apache-2.0
4
  datasets:
 
5
  - WeMake/Intelligent-Content-Understanding
6
+ - QingyiSi/Alpaca-CoT
7
+ - HuggingFaceH4/MATH-500
8
+ - zai-org/LongWriter-6k
9
  language:
10
  - en
11
  pipeline_tag: text-generation
12
  ---
13
 
14
+ # MoA-Metric-LM-150M (Convergent)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ A compact-but-capable ≈150M parameter causal LM that replaces dot-product attention with metric-native attention and augments sequence geometry with BlackHoleRoPE (a learnable, stable RoPE variant). Designed to train and run on modest hardware (CPU-first friendly) while staying fully compatible with 🤗
17
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ # Why this model?
20
+ • Distance scores, not dot products. Heads score with L2, cosine, or diag-Mahalanobis distances. This gives direct control over geometry, often stabilizes training, and can be more sample-efficient.
21
+ • BlackHoleRoPE positional encoding.
22
+ • Q/K: pure unit-modulus rotation (unitary → numerically stable).
23
+ • V: bounded-energy gating (Penrose-inspired), optionally modulated by a discrepancy signal.
24
+ • Parameters synthesized from a tiny Fourier basis → extrapolable and cache-friendly, with low memory.
25
+ • MoA (Mixture-of-Architectures) block. Token-wise router softly blends four heads per block:
26
+ 1. LocalConv (depthwise token-local conv)
27
+ 2. MetricMHAttention (multi-head metric attention)
28
+ 3. ChannelMix (MLP)
29
+ 4. MetricMQA (multi-query, shared K/V)
30
+ • Triangle-Inequality (TI) regularizer. Keeps metric heads honest by penalizing violations over random triples.
31
+ • Runs on CPUs. Implemented to behave well in FP32 on AVX2/AVX-512 machines.
32
 
33
+
34
 
35
+ ## Model at a glance
36
 
37
+ Property Value
38
+ Parameters ~150 M (exact count depends on vocab; see config.json)
39
+ Layers 12–24 depending on variant (MoA blocks)
40
+ Hidden size ≥ 1024 in the 400 M variant (head dim divisible by #heads)
41
+ Attention Metric-native (L2 / cosine / diag-Mahalanobis), plus MetricMQA
42
+ Positional BlackHoleRoPE per-head (rope_global for MH-Attn, rope_mqa for MQA)
43
+ Router Token-wise soft mixture across the four heads (+ optional bias gate)
44
+ FFN HyperFFN = SwiGLU MLP + SepConv1d + Low-Rank path (router-mixed)
45
+ Context Trained primarily at 512–1024 tokens; config allows up to 2048
46
+ Precision Training FP32 (CPU-friendly); inference FP32/BF16/FP16 supported
47
+ License Apache-2.0
48
 
49
+ Note on context: training emphasized 512–1024; BlackHoleRoPE is extrapolable, but throughput and quality beyond training lengths depend on your hardware and data.
50
 
51
+
52
 
53
+ Intended use & limitations
54
 
55
+ Intended: compact assistants, long-context reading/QA, math-style step reasoning, research on distance-based attention and geometric inductive biases.
56
 
57
+ Not intended: safety-critical use, heavy factual QA at web scale, or domains requiring guaranteed accuracy. Evaluate carefully before deployment.
58
 
59
+
60
 
61
+ ## Datasets
62
+ - WeMake/Intelligent-Content-Understanding ~256k Tokens, [8, 256] [4, 512]
63
+ - QingyiSi/Alpaca-CoT ~128K Tokens [2, 1024], [1, 2048] [4, 512]
64
+ - HuggingFaceH4/MATH-500 ~256k Tokens, [8, 256] [4, 512]
65
+ - zai-org/LongWriter-6k ~128k Tokens [2, 1024] [1, 2048]
66
 
67
+ Training used modest token budgets (hundreds of thousands). Reported training logs showed healthy loss descent on both 512 and 1024 sequence lengths on CPU runs. Exact metrics will vary with tokenizer, preprocessing, and optimizer settings.
68
 
69
+
70
+ ```python
71
+ Installation
72
 
73
+ pip install transformers accelerate sentencepiece
74
 
 
75
 
76
+
77
 
78
+ Quick start
79
 
80
+ from transformers import AutoTokenizer, AutoModelForCausalLM
81
+ import torch
82
 
83
+ repo = "reaperdoesntknow/MoA-150M"
84
 
85
+ tok = AutoTokenizer.from_pretrained(repo)
86
+ model = AutoModelForCausalLM.from_pretrained(
87
+ repo, torch_dtype=torch.float32, device_map="cpu"
88
+ ).eval()
89
 
90
+ prompt = "Read and answer: If 3x + 2 = 17, what is x?\nReasoning:"
91
+ inputs = tok(prompt, return_tensors="pt")
92
 
93
+ with torch.no_grad():
94
+ out = model.generate(
95
+ **inputs,
96
+ max_length=256,
97
+ do_sample=True,
98
+ top_p=0.9,
99
+ temperature=0.8,
100
+ pad_token_id=tok.eos_token_id,
101
+ )
102
 
103
+ print(tok.decode(out[0], skip_special_tokens=True))
104
 
105
+ Pipeline usage
106
 
107
+ from transformers import pipeline
108
+ repo = "reaperdoesntknow/MoA-400M"
109
+ pipe = pipeline("text-generation", model=repo, device_map="cpu")
110
+ print(
111
+ pipe(
112
+ "Question: Who wrote 'The Selfish Gene'?\nAnswer:",
113
+ max_length=128,
114
+ do_sample=False,
115
+ )[0]["generated_text"]
116
+ )
117
 
118
+ ```
119
+
 
 
 
120
 
121
+ ## Architecture details
122
 
123
+ Metric attention (MH)
124
+ • Scores:
125
+ • L2: -||q-k||² / sqrt(d)
126
+ • Cosine: normalized dot → scaled
127
+ • diag-Mahalanobis: per-head diagonal scale on dimensions
128
+ • Stability: logits scaled by a learnable α; optional radius-based pruning mask for efficiency.
129
+ • Value path: post-attention Up/Down projector (gated) for expressive value mixing.
130
 
131
+ Metric MQA (shared K/V)
132
+ • K and V are shared (single projection) and broadcast; queries remain multi-head. Useful for throughput and memory.
133
 
134
+ ## BlackHoleRoPE
135
+ • Q/K rotation only (unit modulus) → preserves norms; avoids value blow-ups.
136
+ • V receives bounded-energy amplification (energy_min..energy_max) with optional discrepancy modulation.
137
+ • Parameters synthesized from a small Fourier basis; reduces cache size and improves length generalization.
138
 
139
+ Routing & gates
140
+ • TokenRouter: per-token weights over {LocalConv, MetricMH, ChannelMix, MetricMQA}.
141
+ • Feature gates: per-head multiplicative scales in (0, 2) around 1.0.
142
+ • Optional router bias adds signed offsets before softmax.
143
 
144
+ Triangle-Inequality regularizer
145
+ • Lightweight penalty on random triples to discourage degenerate metric geometry.
146
 
147
+
148
 
149
+ Training recipe (reference)
150
+ • Device: CPU (AVX2/AVX-512 recommended).
151
+ • Precision: FP32.
152
+ • Optimizer: AdamW or Adam (β₁=0.9, β₂=0.95–0.999 work); cosine LR or linear warmup.
153
+ • Batch/seq: [batch, seq] = [2–4, 512–1024].
154
+ • Regularization: modest dropout in attention/value paths; optional TI penalty.
155
 
156
+ If you see NaN/Inf during sampling, ensure masks are additive 0/-inf, clamp logits when rows are fully masked, and set a pad_token_id in .generate().
157
 
158
+
159
 
160
+ Evaluation notes
161
 
162
+ The model targets behavioral quality per FLOP rather than leaderboard chasing. On held-out long-context QA and small math checks, it shows:
163
+ • Robust token-to-token coherence at 512–1024.
164
+ • Stable generation on CPU with FP32.
165
+ • Competitive loss trends versus dot-product baselines trained under the same compute.
166
 
167
+ Please share issues/benchmarks via the repo so results can be tracked.
168
 
169
+
170
 
171
+ How to fine-tune
172
+ ```python
173
+ from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
174
+ from datasets import load_dataset
175
 
176
+ repo = "reaperdoesntknow/MoA-150M"
177
+ tok = AutoTokenizer.from_pretrained(repo)
178
+ model = AutoModelForCausalLM.from_pretrained(repo)
179
+
180
+ ds = load_dataset("yzhuang/Agentic-Long-Context-Understanding-QA", split="train[:2%]")
181
+
182
+ def tok_fn(ex):
183
+ x = tok(
184
+ ex["question"] + "\n" + ex["context"] + "\nAnswer:",
185
+ truncation=True,
186
+ max_length=512,
187
+ )
188
+ x["labels"] = x["input_ids"].copy()
189
+ return x
190
+
191
+ tds = ds.map(tok_fn, remove_columns=ds.column_names)
192
+
193
+ args = TrainingArguments(
194
+ output_dir="./moa400m-finetune",
195
+ per_device_train_batch_size=2,
196
+ gradient_accumulation_steps=1,
197
+ num_train_epochs=1,
198
+ learning_rate=5e-4,
199
+ weight_decay=0.0,
200
+ warmup_steps=100,
201
+ logging_steps=10,
202
+ save_steps=200,
203
+ fp16=False,
204
+ bf16=False,
205
+ )
206
+
207
+ trainer = Trainer(model=model, args=args, train_dataset=tds)
208
+ trainer.train()
209
+
210
+ ```
211
+
212
+ Known behaviors / tips
213
+ • Context > 1024: works, but CPU throughput drops; BlackHoleRoPE helps stability, not throughput.
214
+ • Sampling: always pass pad_token_id (often eos_token_id) to .generate(); avoid temperature > 1.2 on small models.
215
+ • KV cache: supported; for CPU you may prefer smaller beams and greedy/small-temperature sampling.
216
 
217
+ ---
218
 
219
+ Safety & responsibility
220
 
221
+ This is a research model. It was trained on public datasets and may produce incorrect or biased content. Do not rely on it for advice or sensitive decisions.
222
 
223
+ ---
224
+ Citation
225
 
226
+ @software{moa_metric_lm_400m,
227
+ title = {MoA-Metric-LM-400M: Distance-based attention with BlackHoleRoPE},
228
+ author = {reaperdoesntknow},
229
+ year = {2025},
230
+ url = {https://huggingface.co/reaperdoesntknow/MoA-400M}
231
+ }
232
 
233
+ ---
234
 
235
+ Acknowledgements
236
 
237
+ Built with 🤗 Transformers and a metric-first rethinking of attention. BlackHoleRoPE draws inspiration from symplectic/rotational encodings and bounded-energy dynamics.