OzTianlu commited on
Commit
10bc31b
Β·
verified Β·
1 Parent(s): 856e3d4

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +393 -0
README.md ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Geilim-1B-Instruct (εΏŒε»‰)
2
+
3
+ > **Deep Causal Internal Reasoning**
4
+ > No verbose CoT, no `<think>` tags, just concise answers powered by implicit reasoning.
5
+
6
+ ---
7
+
8
+ ## πŸ’‘ Introduction
9
+
10
+ Recent advances in reasoning models (DeepSeek R1, o1) have demonstrated impressive capabilities through Chain-of-Thought (CoT) reasoning. However, we observe several critical drawbacks:
11
+
12
+ **Problems with External CoT:**
13
+ 1. **Verbosity Tax**: Models generate hundreds of tokens in `<think>` tags before answering, increasing latency and cost
14
+ 2. **Autoregressive Dependency**: Models must "see" their reasoning to follow it, forcing sequential token generation
15
+ 3. **Token Inefficiency**: Users pay for reasoning traces they often don't need, only the final answer matters
16
+ 4. **Production Overhead**: Verbose outputs are impractical for real-time APIs and edge deployment
17
+
18
+ **Our Insight**: What if reasoning could happen *internally* in the model's hidden states, without generating verbose traces?
19
+
20
+ **Geilim-1B-Instruct** addresses these limitations through a hybrid architecture combining:
21
+ - **ASPP (Adjacency-Structured Parallel Propagation)**: Graph-based causal chains for structured reasoning
22
+ - **Ο€-flow (Probability Flow Dynamics)**: Internal refinement in probability space without token generation
23
+ - **Hybrid Gating**: Learnable balance between structured and attention-based processing
24
+
25
+ The result: Deep reasoning capability with concise outputs - the best of both worlds.
26
+
27
+ ---
28
+
29
+ ## 🎯 Core Value Proposition
30
+
31
+ **Geilim-1B-Instruct is the anti-verbose reasoning model.**
32
+
33
+ | Model Type | Reasoning Approach | Output Style |
34
+ |------------|-------------------|--------------|
35
+ | **Baseline** (Llama-3.2-1B) | Limited reasoning | Direct but may lack depth |
36
+ | **CoT Models** (DeepSeek R1, o1) | External reasoning chains | Verbose `<think>` tags, long outputs |
37
+ | **Geilim-1B-Instruct** | **Internal reasoning** | **Concise answers, reasoning in hidden states** |
38
+
39
+ **Key Differentiator**: Geilim performs deep causal reasoning **internally** through ASPP+Ο€-flow architecture, then outputs only the final answer. You get the reasoning quality without the verbosity tax.
40
+
41
+ ---
42
+
43
+ ## πŸ—οΈ Architecture Overview
44
+
45
+ Geilim-1B-Instruct combines three key components for implicit reasoning:
46
+
47
+ ### 1. **ASPP Operator** (Adjacency-Structured Parallel Propagation)
48
+ - **Union-Find graph structure**: Linear causal chain where each token only connects to its parent
49
+ - **Iterative message passing**: `h_i^(t+1) = Ο†(h_i^(t), h_parent[i])`
50
+ - **K-step evolution**: Adaptive 2-8 steps of causal propagation
51
+ - **Complexity**: O(n) - efficient linear-time reasoning
52
+
53
+ **Why it matters**: ASPP creates explicit causal relationships between tokens, allowing information to flow through a reasoning chain without generating output tokens.
54
+
55
+ ### 2. **Ο€-flow** (Probability Flow Dynamics)
56
+ - **Velocity field learning**: `h' = h + Ξ± * v(h)` where `v(h)` is a learned refinement
57
+ - **Multi-step refinement**: Iterates in probability space to converge on the correct answer
58
+ - **Gated application**: Model learns when to refine (complex questions) vs when to skip (simple questions)
59
+ - **Internal convergence**: Reasoning happens in hidden states, not in generated text
60
+
61
+ **Why it matters**: Ο€-flow eliminates the need for external CoT by performing iterative refinement internally. The model "thinks" in its hidden states and outputs only the final result.
62
+
63
+ ### 3. **Hybrid Gating Mechanism**
64
+ ```
65
+ output = gate * ASPP(x) + (1-gate) * Attention(x)
66
+ ```
67
+ - Combines structured causal reasoning (ASPP) with flexible attention
68
+ - Learnable balance between graph-based and sequence-based processing
69
+ - Applied to all 30 layers of the base model (Llama-3.2-1B)
70
+
71
+ ---
72
+
73
+ ## 🧠 Why Ο€-flow Eliminates Verbosity
74
+
75
+ ### The Problem with Traditional CoT
76
+
77
+ **External Reasoning Models** (DeepSeek R1, o1-style):
78
+ ```
79
+ User: What is 15 * 8?
80
+
81
+ Model: <think>
82
+ Let me break this down step by step:
83
+ 1. First, I'll multiply 15 by 8
84
+ 2. 15 * 8 = 15 * (10 - 2)
85
+ 3. Using distributive property: 15*10 - 15*2
86
+ 4. 150 - 30 = 120
87
+ Therefore, the answer is 120.
88
+ </think>
89
+
90
+ The answer is 120.
91
+ ```
92
+ - **Output**: 250+ characters
93
+ - **Latency**: High (many tokens to generate)
94
+ - **Cost**: Expensive (charged per token)
95
+
96
+ ### Geilim's Internal Reasoning
97
+
98
+ **Geilim-1B-Instruct** (ASPP+Ο€-flow):
99
+ ```
100
+ User: What is 15 * 8?
101
+
102
+ Model: 120
103
+ ```
104
+ - **Output**: 3 characters
105
+ - **Latency**: Low (minimal generation)
106
+ - **Cost**: Minimal
107
+ - **Reasoning**: Happened internally through:
108
+ 1. ASPP causal chain propagating arithmetic relationships
109
+ 2. Ο€-flow refining probability distribution across answer space
110
+ 3. Convergence to correct answer in hidden states
111
+
112
+ ---
113
+
114
+ ## πŸ”¬ Technical Mechanism
115
+
116
+ ### How Ο€-flow Achieves Internal Reasoning
117
+
118
+ 1. **Probability Space Operations**
119
+ - Instead of generating tokens to explore answers, Ο€-flow refines probability distributions directly
120
+ - `v(h)`: Learned velocity field that corrects the model's initial judgment
121
+ - Multi-step: `h^(0) β†’ h^(1) β†’ h^(2)` (2 refinement steps)
122
+
123
+ 2. **Convergence Without Output**
124
+ - Traditional models need to "see" their reasoning to follow it (autoregressive dependency)
125
+ - Ο€-flow breaks this: reasoning occurs in parallel across all positions simultaneously
126
+ - The model converges internally before generating any output token
127
+
128
+ 3. **Adaptive Complexity**
129
+ - `pi_flow_use_gate=True`: Model learns when refinement is needed
130
+ - Simple questions: Direct output (gate β‰ˆ 0, skip refinement)
131
+ - Complex questions: Internal multi-step refinement (gate β‰ˆ 1, apply Ο€-flow)
132
+ - User always sees concise output regardless
133
+
134
+ 4. **Synergy with ASPP**
135
+ - ASPP provides causal structure (parent-child dependencies)
136
+ - Ο€-flow refines along these dependencies
137
+ - **Result**: Structured reasoning (not just attention) + probabilistic convergence = deep causal understanding
138
+
139
+ ---
140
+
141
+ ## πŸ“Š Configuration
142
+
143
+ ### Model Architecture
144
+ - **Base Model**: Llama-3.2-1B-Instruct (1.26B params)
145
+ - **Total Parameters**: ~1.4B (140M additional ASPP+Ο€-flow params)
146
+ - **Hybrid Layers**: All 30 layers (universal reasoning capability)
147
+
148
+ ### ASPP Settings
149
+ ```python
150
+ aspp_hidden_dim: 512 # vs 2048 model hidden_size (reduce overfitting)
151
+ aspp_num_steps: 2-8 # learnable via sigmoid gating
152
+ aspp_dropout: 0.15
153
+ aspp_num_neighbors: 1 # Union-Find: parent-only connections
154
+ ```
155
+
156
+ ### Ο€-flow Settings
157
+ ```python
158
+ pi_flow: True # Enable probability flow refinement
159
+ pi_flow_steps: 2 # 2-step refinement
160
+ pi_flow_scale: 0.5 # Moderate refinement strength
161
+ pi_flow_use_gate: True # Adaptive gating
162
+ ```
163
+
164
+ ---
165
+
166
+ ## πŸš€ Quick Start
167
+
168
+ ### Installation
169
+ ```bash
170
+ pip install transformers torch
171
+ ```
172
+
173
+ ### Basic Usage
174
+ ```python
175
+ from transformers import AutoTokenizer, AutoModelForCausalLM
176
+ import torch
177
+
178
+ # Load model
179
+ model_path = "NoesisLab/Geilim-1B-Instruct"
180
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
181
+ model = AutoModelForCausalLM.from_pretrained(
182
+ model_path,
183
+ trust_remote_code=True,
184
+ torch_dtype=torch.bfloat16,
185
+ device_map="auto",
186
+ )
187
+
188
+ # Generate response
189
+ prompt = "A store has 120 apples. They sell 35 in the morning and 48 in the afternoon. How many are left?"
190
+ messages = [{"role": "user", "content": prompt}]
191
+
192
+ input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
193
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
194
+
195
+ outputs = model.generate(
196
+ **inputs,
197
+ max_new_tokens=128,
198
+ temperature=0.7,
199
+ do_sample=True,
200
+ top_p=0.9,
201
+ )
202
+
203
+ response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
204
+ print(response) # Expected: "37" or "37 apples are left." (concise!)
205
+ ```
206
+
207
+ ### Advanced Usage
208
+ ```python
209
+ # For math problems requiring step-by-step (if needed)
210
+ # Note: Geilim prefers concise outputs, but can show work if prompted
211
+ prompt = "Explain how you would solve: What is 15 * 23?"
212
+
213
+ # For best results with implicit reasoning
214
+ generation_config = {
215
+ "max_new_tokens": 128, # Keep low to encourage conciseness
216
+ "temperature": 0.7, # Moderate sampling
217
+ "do_sample": True,
218
+ "top_p": 0.9,
219
+ "repetition_penalty": 1.1, # Prevent loops
220
+ }
221
+ ```
222
+
223
+ ---
224
+
225
+ ## πŸŽ“ Training Details
226
+
227
+ ### Dataset
228
+ - **Mixed-Benchmark-Dataset** (composite reasoning benchmarks)
229
+ - 25% GSM8K (math reasoning)
230
+ - 30% HellaSwag (commonsense)
231
+ - 20% ARC (science QA)
232
+ - 10% OpenHermes (high-quality responses)
233
+ - 15% Capybara (multi-turn conversations)
234
+
235
+ ### Training Configuration
236
+ - **Framework**: TRL SFTTrainer with packing
237
+ - **Epochs**: 2
238
+ - **Batch Size**: Effective 8 (per_device=2, grad_accum=4)
239
+ - **Learning Rate**: 2e-4 with 10% warmup
240
+ - **Precision**: bfloat16 with gradient checkpointing
241
+ - **Optimizer**: AdamW (weight_decay=0.1, max_grad_norm=1.0)
242
+
243
+ ### Training Philosophy
244
+ Unlike CoT models trained on verbose reasoning chains, Geilim is trained on **answer-focused data** where:
245
+ - Correct answers are rewarded
246
+ - Reasoning quality is learned implicitly through ASPP+Ο€-flow gradients
247
+ - The model learns to converge internally rather than generate external reasoning
248
+
249
+ ---
250
+
251
+ ## πŸ“ˆ Evaluation
252
+
253
+ ### Reasoning Quality Tests
254
+ Geilim is evaluated on:
255
+ 1. **Math reasoning** (GSM8K-style arithmetic)
256
+ 2. **Commonsense reasoning** (HellaSwag, PIQA)
257
+ 3. **Logic puzzles** (multi-hop deduction)
258
+ 4. **Reading comprehension** (information tracking)
259
+ 5. **Causal reasoning** (cause-effect relationships)
260
+
261
+ ### Key Metrics
262
+ - **Answer correctness** (primary goal)
263
+ - **Response conciseness** (< 150 chars = concise)
264
+ - **Reasoning traces** (should be absent from output, present in hidden states)
265
+
266
+ ### Test Script
267
+ ```bash
268
+ python test_geilim.py
269
+ ```
270
+ Compares Geilim vs Llama-3.2-1B-Instruct baseline on 8 reasoning tasks.
271
+
272
+ ### Run Benchmarks
273
+ ```bash
274
+ python run_lmeval.py
275
+ ```
276
+ Evaluates on: WinoGrande, ARC (easy/challenge), HellaSwag, PIQA.
277
+
278
+ ---
279
+
280
+ ## 🎯 Use Cases
281
+
282
+ ### Ideal For:
283
+ - **Production APIs**: Low latency, low token cost
284
+ - **Real-time applications**: Minimal generation overhead
285
+ - **Cost-sensitive deployments**: Pay only for the answer, not the reasoning
286
+ - **User-facing chat**: Clean outputs without technical reasoning traces
287
+ - **Mobile/edge devices**: Smaller token budgets
288
+
289
+ ### Not Ideal For:
290
+ - **Educational use cases**: When you want to show reasoning steps to users
291
+ - **Debugging/verification**: When explicit reasoning helps validate answers
292
+ - **Research**: When analyzing reasoning chains is the goal
293
+
294
+ ---
295
+
296
+ ## πŸ†š Comparison Table
297
+
298
+ | Feature | Geilim-1B-Instruct | DeepSeek R1 | Llama-3.2-1B |
299
+ |---------|-----------|-------------|--------------|
300
+ | **Model Size** | 1.4B | 1.5B | 1.26B |
301
+ | **Reasoning Type** | Internal (ASPP+Ο€-flow) | External (CoT) | Limited |
302
+ | **Output Style** | Concise answers | Verbose `<think>` tags | Direct answers |
303
+ | **Latency** | Low | High (many tokens) | Low |
304
+ | **Cost per query** | Low | High | Low |
305
+ | **Reasoning depth** | Deep (hidden states) | Deep (explicit) | Shallow |
306
+ | **Token efficiency** | High | Low | Medium |
307
+
308
+ ---
309
+
310
+ ## πŸ“š Technical References
311
+
312
+ ### Core Papers & Concepts
313
+ - **Union-Find Data Structure**: Parent-only connections for efficient causal propagation
314
+ - **Probability Flow ODEs**: Continuous refinement in probability space (inspired by diffusion models)
315
+ - **Hybrid Architectures**: Combining structured (graph) and unstructured (attention) reasoning
316
+
317
+ ### Related Work
318
+ - DeepSeek R1: External reasoning chains
319
+ - o1 series: Long-form CoT reasoning
320
+ - SmolLM2: Efficient small language models
321
+ - Graph Neural Networks: Structured message passing
322
+
323
+ ---
324
+
325
+ ## πŸ”§ Development
326
+
327
+ ### Custom Model Registration
328
+ - **Model type**: `asterisk` (registered with HuggingFace AutoModel)
329
+ - **Config class**: `AsteriskConfig` (extends LlamaConfig)
330
+ - **Model class**: `AsteriskForCausalLM` (extends LlamaForCausalLM)
331
+ - **Loading**: Requires `trust_remote_code=True`
332
+
333
+ ### Training Your Own
334
+ ```bash
335
+ # Install dependencies
336
+ pip install -r requirements.txt
337
+
338
+ # Train Geilim-1B-Instruct
339
+ python train_geilim.py
340
+ ```
341
+
342
+ ---
343
+
344
+ ## 🌟 Key Takeaways
345
+
346
+ 1. **No verbose CoT**: Geilim performs reasoning internally, outputs concisely
347
+ 2. **ASPP+Ο€-flow**: Causal graph structure + probability flow refinement
348
+ 3. **Deep causal understanding**: Reasoning happens in hidden states, not generated text
349
+ 4. **Production-ready**: Low latency, low cost, clean outputs
350
+ 5. **Same reasoning depth**: Matches CoT models without the verbosity
351
+
352
+ ---
353
+
354
+ ## πŸ“ Citation
355
+
356
+ If you use Geilim-1B-Instruct in your research or applications, please cite:
357
+
358
+ ```bibtex
359
+ @misc{geilim2026,
360
+ title={Geilim-1B-Instruct: Deep Causal Internal Reasoning via ASPP and Probability Flow},
361
+ author={NoesisLab},
362
+ year={2026},
363
+ howpublished={HuggingFace Model Hub},
364
+ url={https://huggingface.co/NoesisLab/Geilim-1B-Instruct}
365
+ }
366
+ ```
367
+
368
+ ---
369
+
370
+ ## 🀝 Acknowledgments
371
+
372
+ - **Base Model**: Llama-3.2-1B-Instruct by Meta
373
+ - **Training Framework**: TRL by HuggingFace
374
+ - **Inspiration**: DeepSeek R1 (for demonstrating value of reasoning), but pursuing conciseness
375
+
376
+ ---
377
+
378
+ ## πŸ“„ License
379
+
380
+ Llama 3.2 Community License
381
+
382
+ ---
383
+
384
+ ## πŸ”— Links
385
+
386
+ - **Model Hub**: https://huggingface.co/NoesisLab/Geilim-1B-Instruct
387
+ - **Repository**: https://github.com/Liuxingyu1111111/Asterisk-R1
388
+
389
+ ---
390
+
391
+ **Built with ❀️ for the era of efficient reasoning models.**
392
+
393
+ *Geilim (εΏŒε»‰) - Cantonese for "cream" - smooth, concise, and rich in substance.*