File size: 21,286 Bytes
c90fe04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21b4315
 
 
 
 
 
 
c90fe04
 
 
21b4315
c90fe04
 
 
21b4315
 
 
 
 
 
 
 
 
 
 
 
 
 
c90fe04
 
 
 
 
8376f90
c90fe04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3356706
c90fe04
 
 
 
 
3356706
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c90fe04
 
3356706
 
c90fe04
 
3356706
 
 
 
c90fe04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ed0063
c90fe04
 
 
 
2ed0063
c90fe04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74de4bb
 
 
c90fe04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74de4bb
c90fe04
 
 
 
 
74de4bb
c90fe04
 
 
 
 
74de4bb
c90fe04
 
74de4bb
c90fe04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
---
library_name: transformers
model_name: Asterisk-Pi
base_model: NoesisLab/Asterisk
tags:
- aspp
- pi-flow
- hybrid-architecture
- graph-reasoning
- probability-flow
- sft
- trl
license: apache-2.0
language:
- en
---

# Asterisk-Pi: ASPP-Attention with π-Flow Refinement

**Asterisk-Pi** is an enhanced version of the Asterisk model that adds **π-flow (probability flow)** refinement to the hybrid ASPP-Attention architecture. Building on the SmolLM2-135M base, Asterisk-Pi implements per-layer iterative refinement inspired by probability flow ODEs from diffusion models, enabling multi-step reasoning through continuous state evolution.

## Model Description

- **Base Model**: [Asterisk](https://huggingface.co/NoesisLab/Asterisk) (SmolLM2-135M-Instruct with ASPP)
- **Architecture**: Hybrid ASPP-Attention + Per-Layer π-Flow (30 hybrid layers)
- **Parameters**: 173.7M (37.5M ASPP + 2.5M π-flow parameters)
- **Training**: Supervised Fine-Tuning on Mixed Benchmark Dataset
- **Framework**: Transformers 4.57.6, TRL 0.27.0

## Key Innovation: π-Flow Refinement

**π-Flow** (Probability Flow) adds iterative refinement to each hybrid layer, inspired by continuous-time probability flow ODEs:

```
h' = h + α * v(h)  [Euler discretization]
```

Where:
- `v(h)` is the velocity field computed by a dedicated ASPP operator
- `α` is a learnable per-token scaling factor (adaptive gating)
- Applied after ASPP-Attention fusion in each layer

This enables **60 total refinement steps** (30 layers × 2 steps each) throughout the model, allowing gradual convergence to more refined representations.

## Evaluation Results

Evaluated on LM-Evaluation-Harness:

| Task | Metric | Asterisk-Pi<br>(173.7M) | Asterisk<br>(171.2M) | SmolLM2-135M<br>(135.6M) | Gemma-3-270m-it<br>(270M) | Δ vs Asterisk | Δ vs SmolLM2 | Δ vs Gemma-3 |
|------|--------|-------------|-----------------|--------------|----------------|---------------|--------------|--------------|
| **ARC-Challenge** | acc_norm | **0.3038** | 0.2884 | 0.2773 | 0.2730 | +0.0154 | **+0.0265** | **+0.0308** |
| **ARC-Easy** | acc_norm | **0.5412** | **0.5450** | 0.4899 | 0.5059 | -0.0038 | **+0.0513** | **+0.0353** |
| **HellaSwag** | acc_norm | 0.4207 | **0.4430** | 0.4293 | 0.3937 | -0.0223 | -0.0086 | **+0.0270** |
| **PIQA** | acc_norm | 0.6703 | **0.6770** | 0.6632 | 0.6692 | -0.0067 | **+0.0071** | +0.0011 |
| **WinoGrande** | acc | **0.5391** | 0.5210 | 0.5154 | 0.5257 | +0.0181 | **+0.0237** | +0.0134 |

### Analysis

**π-Flow improvements over base Asterisk:**
- **ARC-Challenge** (+1.54%): More challenging reasoning benefits from iterative refinement
- **WinoGrande** (+1.81%): Multi-step resolution helps with pronoun disambiguation

**Improvements over SmolLM2-135M base:**
- **ARC-Challenge** (+2.65%): Hybrid architecture + π-flow significantly improves complex reasoning
- **ARC-Easy** (+5.13%): Strong gains on elementary science questions
- **WinoGrande** (+2.37%): Better pronoun disambiguation through iterative refinement
- **PIQA** (+0.71%): Modest gains on physical commonsense

**Outperforming Gemma-3-270m-it (with 96M fewer parameters):**
- **ARC-Challenge** (+3.08%): Superior reasoning despite being 35% smaller
- **ARC-Easy** (+3.53%): Significant advantage on elementary science
- **HellaSwag** (+2.70%): Much stronger commonsense reasoning
- **WinoGrande** (+1.34%): Better coreference resolution
- **PIQA** (+0.11%): Comparable physical reasoning

**Key insight**: Asterisk-Pi (173.7M params) consistently outperforms the much larger Gemma-3-270m-it (270M params), demonstrating that the hybrid ASPP-Attention architecture with π-flow refinement achieves superior parameter efficiency. The structured reasoning approach enables better performance per parameter, especially on complex multi-step reasoning tasks.

## Architecture

### Overview

![Asterisk-Pi Architecture](./Arch.png)

*Figure: Asterisk-Pi architecture showing the hybrid ASPP-Attention structure with π-flow refinement. Each of the 30 layers contains parallel ASPP and Attention branches, gated fusion, and iterative π-flow refinement using probability flow ODE.*

```
Input → [30 Hybrid Layers with π-Flow] → Output

Each Hybrid Layer:
1. ASPP-Attention Fusion (from base Asterisk)
2. π-Flow Refinement (NEW)
3. Feed-Forward Network
```

### 1. Hybrid ASPP-Attention Layer (Base Asterisk)

```python
class HybridASPPAttentionLayer:
    """
    Combines ASPP operator with standard attention

    Components:
    - ASPP operator: Local structured reasoning with Union-Find graph propagation
    - Standard attention: Global context
    - Gated fusion: Dynamic balancing
    """
```

#### ASPP Operator: Union-Find Graph Propagation

The ASPP operator uses a **Union-Find (Disjoint Set Union)** structure for efficient graph-based message passing. Unlike traditional attention's O(n²) complexity or skip-list's O(n log n), Union-Find achieves **O(n) complexity with nearly constant-time operations**.

**Graph Structure - Union-Find Parent Chain:**

```
Position:  [0]  [1]  [2]  [3]  [4]  [5]  ...  [n-1]
Parent:    [0] ← 0  ← 1  ← 2  ← 3  ← 4  ...  ← n-2
           (root)

- Position 0: points to itself (root of the tree)
- Position i (i>0): points to position i-1 (parent)
- Forms a linear chain structure for sequential token relationships
```

This creates a **directed acyclic graph (DAG)** where information flows from children to parents, naturally capturing left-to-right sequential dependencies in language modeling.

**Graph Propagation Aggregation:**

Each ASPP evolution step performs parent-based message passing:

```python
# Pseudocode for one ASPP propagation step
for position i in sequence:
    # 1. Find parent using Union-Find structure
    parent_idx = compute_parent_indices()[i]  # O(1) with path compression

    # 2. Gather parent features
    parent_features = hidden_states[parent_idx]

    # 3. Message aggregation: combine self + parent
    message_input = concat([hidden_states[i], parent_features])

    # 4. Update via learned transformation
    new_state = message_net(message_input)  # 2-layer MLP

    # 5. Scaled residual connection
    hidden_states[i] = hidden_states[i] + residual_scale * new_state
    hidden_states[i] = layer_norm(hidden_states[i])
```

**Key properties of Union-Find propagation:**

1. **O(n) Complexity**: Each position performs exactly one parent lookup and one aggregation
   - No expensive attention computation (O(n²))
   - No multi-level skip connections (O(n log n))
   - Simple indexing operation: `parent_features = h[parent_indices]`

2. **Hierarchical Information Flow**: After K steps, position i can access information from positions [i-K, i]
   - K=1: immediate parent only
   - K=2: grandparent (2 positions back)
   - K=4 (default): great-great-grandparent (4 positions back)
   - Information propagates through the chain structure

3. **Learnable Aggregation**: The `message_net` MLP learns how to combine self and parent features
   - Input: `[self_features || parent_features]` (2D dimensions)
   - Output: `D` dimensional update vector
   - Dropout regularization for robustness

4. **Path Compression Potential**: Can extend to dynamic parent reassignment
   - Current implementation: static `parent[i] = i-1` chain
   - Future extension: learn parent assignments based on semantic similarity
   - Enables adaptive graph structure during forward pass

**Union-Find vs. Other Graph Structures:**

| Structure | Complexity | Receptive Field | Connections per Node |
|-----------|------------|-----------------|----------------------|
| **Full Attention** | O(n²) | Global | n-1 (all positions) |
| **Skip-List** | O(n log n) | Multi-scale | O(log n) (multiple levels) |
| **Union-Find** | O(n) | Local chain | 1 (parent only) |
| **Dilated Conv** | O(n·k) | Sparse | k (fixed window) |

Union-Find achieves the **lowest complexity** while maintaining effective information propagation through iterative K-step evolution.

**Theoretical Foundation - Union-Find in Graph Algorithms:**

Union-Find is a classic data structure for disjoint set operations:
- **Find**: Determine which set an element belongs to (with path compression: O(α(n)) ≈ O(1))
- **Union**: Merge two sets into one
- **Applications**: Kruskal's MST algorithm, connected components, cycle detection

In Asterisk-Pi:
- Each token position is a node in the graph
- Parent pointers define the tree structure
- Message passing simulates "Find" operations (traversing to ancestors)
- Can extend to dynamic "Union" operations (merging related tokens)

**Multi-Step Propagation:**

With K=4 evolution steps, information flow becomes:
```
Step 1: Position i accesses parent i-1
Step 2: Position i now has information from i-2 (via i-1)
Step 3: Position i now has information from i-3 (propagated through chain)
Step 4: Position i now has information from i-4 (fully propagated)

Result: Each position has aggregated context from 4 previous positions
        through efficient O(n) operations
```

This multi-step propagation is crucial for:
- **Local context**: Recent tokens for coherence
- **Gradient flow**: Direct paths for backpropagation
- **Efficiency**: Linear cost instead of quadratic attention

**Fusion mechanism:**
```
aspp_out = ASPP(hidden_states)            # Union-Find graph propagation (O(n))
attn_out = Attention(hidden_states, mask, ...)  # Global attention (O(n²))
gate = sigmoid(linear([aspp_out || attn_out]))
fused = gate * aspp_out + (1 - gate) * attn_out

# Combines:
# - Local structured reasoning (ASPP via Union-Find)
# - Global contextual awareness (Attention)
```

### 2. π-Flow Refinement (Per-Layer)

```python
# Added to each hybrid layer
self.pi_flow_aspp = ASPPOperator(...)        # Velocity field network
self.pi_flow_scale = Parameter(0.2)          # Learnable flow strength
self.pi_flow_gate = MLP(hidden_size -> 1)    # Token-wise adaptive gating
```

**π-Flow forward pass:**
```
function π_flow_refinement(hidden_states):
    for step = 1 to π_flow_steps:
        # Compute velocity field using dedicated ASPP
        v = pi_flow_aspp(hidden_states)

        # Adaptive per-token gating
        gate = sigmoid(pi_flow_gate(hidden_states))  # [B, L, 1]
        alpha = pi_flow_scale * gate

        # Euler step in probability space
        hidden_states = hidden_states + alpha * v

    return hidden_states
```

**Key design choices:**
1. **Per-layer π-flow**: Each of 30 layers has independent π-flow parameters
2. **Learnable scale**: `pi_flow_scale` adapts flow strength during training
3. **Token-wise gating**: Different tokens get different flow magnitudes
4. **ASPP velocity**: Reuses ASPP architecture for computing v(h)

### 3. Complete Layer Pseudocode

```
function HybridLayerWithPiFlow(hidden_states, attention_mask, ...):
    residual = hidden_states
    hidden_states = input_layernorm(hidden_states)

    # === Hybrid ASPP-Attention (Base Asterisk) ===
    aspp_output = aspp_operator(hidden_states)
    attn_output = self_attention(hidden_states, attention_mask, ...)

    # Gated fusion
    fusion_input = concat([aspp_output, attn_output])
    gate = sigmoid(linear(dropout(fusion_input)))
    fused_output = gate * aspp_output + (1 - gate) * attn_output

    # Residual connection
    hidden_states = residual + fused_output

    # === π-Flow Refinement (NEW) ===
    for step in [1..pi_flow_steps]:
        v = pi_flow_aspp(hidden_states)
        alpha = pi_flow_scale * sigmoid(pi_flow_gate(hidden_states))
        hidden_states = hidden_states + alpha * v

    # === MLP Block ===
    residual = hidden_states
    hidden_states = post_attention_layernorm(hidden_states)
    hidden_states = mlp(hidden_states)
    hidden_states = residual + hidden_states

    return hidden_states
```

## Parameter Breakdown

| Component | Parameters | Notes |
|-----------|------------|-------|
| **Base SmolLM2** | 135.6M | Embeddings, attention, MLP |
| **ASPP Operators** | 35.5M | 30 layers × ~1.2M each |
| **π-Flow ASPPs** | 2.3M | 30 layers × ~77k each |
| **π-Flow Gates** | 0.2M | 30 layers × ~7k each |
| **π-Flow Scales** | 30 | 30 learnable scalars |
| **Total** | **173.7M** | +28% vs base SmolLM2 |

π-Flow adds only **1.4% more parameters** (2.5M) compared to base Asterisk (171.2M) while providing 60 total refinement steps.

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Asterisk-Pi",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Asterisk-Pi")

# Generate text
messages = [{"role": "user", "content": "Explain the waterfall model in software engineering."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

### Training Dataset

Mixed benchmark dataset for testing true capabilities:

| Dataset | Ratio | Purpose |
|---------|-------|---------|
| **GSM8K** | 25% | Math reasoning benchmark |
| **HellaSwag** | 30% | Commonsense reasoning benchmark |
| **ARC** | 20% | Science QA (Easy + Challenge) |
| **OpenHermes** | 10% | High-quality long-form responses |
| **Capybara** | 15% | Multi-turn conversations |

Total: ~10,148 training samples

### Training Configuration

- **Starting Point**: Asterisk checkpoint (base ASPP-Attention model)
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.1)
- **Batch Size**: 2 per device, gradient accumulation=4 (effective batch=8)
- **Epochs**: 2
- **Scheduler**: Linear warmup (10% of steps)
- **Mixed Precision**: bfloat16
- **Gradient Checkpointing**: Enabled
- **Max Grad Norm**: 1.0

### π-Flow Configuration

```python
pi_flow = True
pi_flow_steps = 2           # 2 refinement steps per layer
pi_flow_scale = 1.0         # Initial flow strength
pi_flow_use_gate = True     # Token-wise adaptive gating
```

### ASPP Configuration (Inherited from Base)

```python
aspp_hidden_dim = 256       # Internal dimension (vs 576 model hidden_size)
aspp_num_steps = 4          # Evolution steps for ASPP
aspp_dropout = 0.2          # Regularization
hybrid_layer_indices = None # All 30 layers
```

## Model Creation from Base Asterisk

```python
from AsteriskForCausalLM import AsteriskForCausalLM
from safetensors.torch import load_file
import torch

# Load Asterisk config and inject π-flow parameters
from AsteriskForCausalLM import AsteriskConfig
config = AsteriskConfig.from_pretrained("path/to/Asterisk", trust_remote_code=True)

# Add π-flow configuration
config.pi_flow = True
config.pi_flow_steps = 2
config.pi_flow_scale = 1.0
config.pi_flow_use_gate = True

# Create model with π-flow
model = AsteriskForCausalLM(config)

# Load pretrained Asterisk weights (strict=False ignores new π-flow params)
state_dict = load_file("path/to/Asterisk/model.safetensors")
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)

# π-flow parameters are randomly initialized
print(f"New π-flow parameters: {len(missing_keys)}")

# Move to device
model = model.to(dtype=torch.bfloat16, device="cuda")
```

## Theoretical Background

### π-Flow: Probability Flow ODE

Inspired by diffusion model score-based formulations:

```
dx/dt = v(x, t)  [Continuous probability flow]
```

Discretized with Euler method:
```
x_{t+1} = x_t + Δt * v(x_t)
```

In Asterisk-Pi:
- `x_t` = hidden states at layer output
- `v(x_t)` = velocity field from dedicated ASPP
- `Δt` = learnable `pi_flow_scale * gate(x_t)`

### Multi-Scale Refinement

- **Layer-level**: 30 hybrid layers with ASPP-Attention fusion
- **π-Flow level**: 2 steps per layer = 60 total refinement operations
- **ASPP-level**: 4 evolution steps within each ASPP = 240 micro-updates

This creates a **hierarchical refinement cascade** enabling gradual convergence to high-quality representations.

### Why π-Flow Helps

1. **Iterative refinement**: Multiple passes allow correcting errors
2. **Adaptive flow**: Token-wise gating focuses computation where needed
3. **Gradient flow**: More direct paths for gradient propagation
4. **Expressiveness**: Increases model capacity with minimal parameters

## Implementation Details

### Return Type Handling

Critical for Transformers compatibility:

```python
# HybridASPPAttentionLayer.forward() returns tensor only
def forward(self, hidden_states, ...) -> torch.Tensor:
    # ... ASPP + Attention + π-flow ...
    return hidden_states  # ✅ Tensor, not tuple

# This matches LlamaDecoderLayer API: -> torch.Tensor
```

### Gradient Checkpointing Compatibility

π-Flow is fully compatible with gradient checkpointing:
- All operations are standard PyTorch ops
- No custom CUDA kernels
- Automatic differentiation through flow steps

### Weight Initialization

- **ASPP parameters**: Transferred from base Asterisk
- **π-Flow ASPP**: Randomly initialized (Xavier uniform)
- **π-Flow scale**: Initialized to 0.2 (conservative)
- **π-Flow gate**: Initialized to output ~0.5 (balanced)

## Files in Checkpoint

```
Asterisk-Pi/
├── AsteriskForCausalLM.py    # Model implementation (with π-flow)
├── config.json                # Model configuration
├── model.safetensors          # Model weights
├── tokenizer.json             # Tokenizer
├── generation_config.json     # Generation settings
└── README.md                  # This file
```

## Differences from Base Asterisk

| Feature | Asterisk | Asterisk-Pi |
|---------|----------|-------------|
| **ASPP-Attention** | ✅ | ✅ |
| **π-Flow Refinement** | ❌ | ✅ (per-layer) |
| **Parameters** | 171.2M | 173.7M (+1.4%) |
| **Refinement Steps** | 30 (layers) | 60 (30 layers × 2) |
| **Training Dataset** | Capybara | Mixed Benchmarks |
| **Complexity** | Medium | High |

## Known Issues & Solutions

### 1. Return Type Errors

**Issue**: `AttributeError: 'tuple' object has no attribute 'dtype'`

**Solution**: `HybridASPPAttentionLayer.forward()` must return `torch.Tensor` only, not tuple. This matches the `LlamaDecoderLayer` API in transformers 4.57.6.

### 2. π-Flow in All Layers vs Final Layer

**Initial approach**: π-flow only in final layer (limited expressiveness)

**Current approach**: π-flow in all 30 hybrid layers for maximum refinement capability.

### 3. Training Stability

π-Flow can cause instability with high learning rates. Use:
- Lower learning rate (5e-4 vs 2e-5 for base)
- Gradient clipping (max_norm=1.0)
- Conservative initial flow scale (0.2-1.0)

## Dependencies

```bash
pip install torch>=2.0.0
pip install transformers>=4.40.0
pip install trl>=0.8.0
pip install datasets>=2.14.0
pip install accelerate>=0.25.0
pip install bitsandbytes
pip install safetensors
```

## Citations

If you use this model, please cite:

```bibtex
@misc{asteriskpi2026,
  title={Asterisk-Pi: Probability Flow Refinement for Hybrid ASPP-Attention Models},
  author={NoesisLab},
  year={2026},
  publisher={Huggingface},
  url={https://huggingface.co/NoesisLab/Asterisk-Pi}
}
```

```bibtex
@misc{asterisk2026,
  title={Asterisk: Hybrid ASPP-Attention Architecture for Enhanced Language Modeling},
  author={NoesisLab},
  year={2026},
  publisher={Huggingface},
  url={https://huggingface.co/NoesisLab/Asterisk}
}
```

```bibtex
@misc{vonwerra2022trl,
  title={{TRL: Transformer Reinforcement Learning}},
  author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
  year={2020},
  journal={GitHub repository},
  publisher={GitHub},
  howpublished={\url{https://github.com/huggingface/trl}}
}
```

```bibtex
@article{allal2024SmolLM2,
  title={SmolLM2 - with great data, comes great performance},
  author={Allal, Loubna Ben and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
  year={2024}
}
```

## Related Work

- **Diffusion Models**: π-flow inspired by probability flow ODEs in score-based diffusion
- **Neural ODEs**: Continuous-depth models with adaptive computation
- **Iterative Refinement**: Multi-pass decoding in sequence models

## Future Directions

1. **Adaptive π-flow steps**: Learn number of refinement steps per layer
2. **Higher-order ODE solvers**: Replace Euler with RK4 or adaptive schemes
3. **Stochastic π-flow**: Add noise injection for exploration
4. **Cross-layer π-flow**: Allow information flow between distant layers

## License

This model inherits the Apache 2.0 license from SmolLM2-135M-Instruct.

## Framework Versions

- **TRL**: 0.27.0
- **Transformers**: 4.57.6
- **PyTorch**: 2.8.0+cu128
- **Datasets**: 4.5.0
- **Tokenizers**: 0.22.2

## Acknowledgments

Built on top of:
- [Asterisk](https://huggingface.co/NoesisLab/Asterisk) - Base ASPP-Attention architecture
- [SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) - Foundation model
- [TRL](https://github.com/huggingface/trl) - Training framework

Special thanks to the diffusion model community for probability flow ODE insights.