|
|
--- |
|
|
library_name: transformers |
|
|
model_name: Asterisk-Pi |
|
|
base_model: NoesisLab/Asterisk |
|
|
tags: |
|
|
- aspp |
|
|
- pi-flow |
|
|
- hybrid-architecture |
|
|
- graph-reasoning |
|
|
- probability-flow |
|
|
- sft |
|
|
- trl |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Asterisk-Pi: ASPP-Attention with π-Flow Refinement |
|
|
|
|
|
**Asterisk-Pi** is an enhanced version of the Asterisk model that adds **π-flow (probability flow)** refinement to the hybrid ASPP-Attention architecture. Building on the SmolLM2-135M base, Asterisk-Pi implements per-layer iterative refinement inspired by probability flow ODEs from diffusion models, enabling multi-step reasoning through continuous state evolution. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: [Asterisk](https://huggingface.co/NoesisLab/Asterisk) (SmolLM2-135M-Instruct with ASPP) |
|
|
- **Architecture**: Hybrid ASPP-Attention + Per-Layer π-Flow (30 hybrid layers) |
|
|
- **Parameters**: 173.7M (37.5M ASPP + 2.5M π-flow parameters) |
|
|
- **Training**: Supervised Fine-Tuning on Mixed Benchmark Dataset |
|
|
- **Framework**: Transformers 4.57.6, TRL 0.27.0 |
|
|
|
|
|
## Key Innovation: π-Flow Refinement |
|
|
|
|
|
**π-Flow** (Probability Flow) adds iterative refinement to each hybrid layer, inspired by continuous-time probability flow ODEs: |
|
|
|
|
|
``` |
|
|
h' = h + α * v(h) [Euler discretization] |
|
|
``` |
|
|
|
|
|
Where: |
|
|
- `v(h)` is the velocity field computed by a dedicated ASPP operator |
|
|
- `α` is a learnable per-token scaling factor (adaptive gating) |
|
|
- Applied after ASPP-Attention fusion in each layer |
|
|
|
|
|
This enables **60 total refinement steps** (30 layers × 2 steps each) throughout the model, allowing gradual convergence to more refined representations. |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluated on LM-Evaluation-Harness: |
|
|
|
|
|
| Task | Metric | Asterisk-Pi<br>(173.7M) | Asterisk<br>(171.2M) | SmolLM2-135M<br>(135.6M) | Gemma-3-270m-it<br>(270M) | Δ vs Asterisk | Δ vs SmolLM2 | Δ vs Gemma-3 | |
|
|
|------|--------|-------------|-----------------|--------------|----------------|---------------|--------------|--------------| |
|
|
| **ARC-Challenge** | acc_norm | **0.3038** | 0.2884 | 0.2773 | 0.2730 | +0.0154 | **+0.0265** | **+0.0308** | |
|
|
| **ARC-Easy** | acc_norm | **0.5412** | **0.5450** | 0.4899 | 0.5059 | -0.0038 | **+0.0513** | **+0.0353** | |
|
|
| **HellaSwag** | acc_norm | 0.4207 | **0.4430** | 0.4293 | 0.3937 | -0.0223 | -0.0086 | **+0.0270** | |
|
|
| **PIQA** | acc_norm | 0.6703 | **0.6770** | 0.6632 | 0.6692 | -0.0067 | **+0.0071** | +0.0011 | |
|
|
| **WinoGrande** | acc | **0.5391** | 0.5210 | 0.5154 | 0.5257 | +0.0181 | **+0.0237** | +0.0134 | |
|
|
|
|
|
### Analysis |
|
|
|
|
|
**π-Flow improvements over base Asterisk:** |
|
|
- **ARC-Challenge** (+1.54%): More challenging reasoning benefits from iterative refinement |
|
|
- **WinoGrande** (+1.81%): Multi-step resolution helps with pronoun disambiguation |
|
|
|
|
|
**Improvements over SmolLM2-135M base:** |
|
|
- **ARC-Challenge** (+2.65%): Hybrid architecture + π-flow significantly improves complex reasoning |
|
|
- **ARC-Easy** (+5.13%): Strong gains on elementary science questions |
|
|
- **WinoGrande** (+2.37%): Better pronoun disambiguation through iterative refinement |
|
|
- **PIQA** (+0.71%): Modest gains on physical commonsense |
|
|
|
|
|
**Outperforming Gemma-3-270m-it (with 96M fewer parameters):** |
|
|
- **ARC-Challenge** (+3.08%): Superior reasoning despite being 35% smaller |
|
|
- **ARC-Easy** (+3.53%): Significant advantage on elementary science |
|
|
- **HellaSwag** (+2.70%): Much stronger commonsense reasoning |
|
|
- **WinoGrande** (+1.34%): Better coreference resolution |
|
|
- **PIQA** (+0.11%): Comparable physical reasoning |
|
|
|
|
|
**Key insight**: Asterisk-Pi (173.7M params) consistently outperforms the much larger Gemma-3-270m-it (270M params), demonstrating that the hybrid ASPP-Attention architecture with π-flow refinement achieves superior parameter efficiency. The structured reasoning approach enables better performance per parameter, especially on complex multi-step reasoning tasks. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
### Overview |
|
|
|
|
|
 |
|
|
|
|
|
*Figure: Asterisk-Pi architecture showing the hybrid ASPP-Attention structure with π-flow refinement. Each of the 30 layers contains parallel ASPP and Attention branches, gated fusion, and iterative π-flow refinement using probability flow ODE.* |
|
|
|
|
|
``` |
|
|
Input → [30 Hybrid Layers with π-Flow] → Output |
|
|
|
|
|
Each Hybrid Layer: |
|
|
1. ASPP-Attention Fusion (from base Asterisk) |
|
|
2. π-Flow Refinement (NEW) |
|
|
3. Feed-Forward Network |
|
|
``` |
|
|
|
|
|
### 1. Hybrid ASPP-Attention Layer (Base Asterisk) |
|
|
|
|
|
```python |
|
|
class HybridASPPAttentionLayer: |
|
|
""" |
|
|
Combines ASPP operator with standard attention |
|
|
|
|
|
Components: |
|
|
- ASPP operator: Local structured reasoning with Union-Find graph propagation |
|
|
- Standard attention: Global context |
|
|
- Gated fusion: Dynamic balancing |
|
|
""" |
|
|
``` |
|
|
|
|
|
#### ASPP Operator: Union-Find Graph Propagation |
|
|
|
|
|
The ASPP operator uses a **Union-Find (Disjoint Set Union)** structure for efficient graph-based message passing. Unlike traditional attention's O(n²) complexity or skip-list's O(n log n), Union-Find achieves **O(n) complexity with nearly constant-time operations**. |
|
|
|
|
|
**Graph Structure - Union-Find Parent Chain:** |
|
|
|
|
|
``` |
|
|
Position: [0] [1] [2] [3] [4] [5] ... [n-1] |
|
|
Parent: [0] ← 0 ← 1 ← 2 ← 3 ← 4 ... ← n-2 |
|
|
(root) |
|
|
|
|
|
- Position 0: points to itself (root of the tree) |
|
|
- Position i (i>0): points to position i-1 (parent) |
|
|
- Forms a linear chain structure for sequential token relationships |
|
|
``` |
|
|
|
|
|
This creates a **directed acyclic graph (DAG)** where information flows from children to parents, naturally capturing left-to-right sequential dependencies in language modeling. |
|
|
|
|
|
**Graph Propagation Aggregation:** |
|
|
|
|
|
Each ASPP evolution step performs parent-based message passing: |
|
|
|
|
|
```python |
|
|
# Pseudocode for one ASPP propagation step |
|
|
for position i in sequence: |
|
|
# 1. Find parent using Union-Find structure |
|
|
parent_idx = compute_parent_indices()[i] # O(1) with path compression |
|
|
|
|
|
# 2. Gather parent features |
|
|
parent_features = hidden_states[parent_idx] |
|
|
|
|
|
# 3. Message aggregation: combine self + parent |
|
|
message_input = concat([hidden_states[i], parent_features]) |
|
|
|
|
|
# 4. Update via learned transformation |
|
|
new_state = message_net(message_input) # 2-layer MLP |
|
|
|
|
|
# 5. Scaled residual connection |
|
|
hidden_states[i] = hidden_states[i] + residual_scale * new_state |
|
|
hidden_states[i] = layer_norm(hidden_states[i]) |
|
|
``` |
|
|
|
|
|
**Key properties of Union-Find propagation:** |
|
|
|
|
|
1. **O(n) Complexity**: Each position performs exactly one parent lookup and one aggregation |
|
|
- No expensive attention computation (O(n²)) |
|
|
- No multi-level skip connections (O(n log n)) |
|
|
- Simple indexing operation: `parent_features = h[parent_indices]` |
|
|
|
|
|
2. **Hierarchical Information Flow**: After K steps, position i can access information from positions [i-K, i] |
|
|
- K=1: immediate parent only |
|
|
- K=2: grandparent (2 positions back) |
|
|
- K=4 (default): great-great-grandparent (4 positions back) |
|
|
- Information propagates through the chain structure |
|
|
|
|
|
3. **Learnable Aggregation**: The `message_net` MLP learns how to combine self and parent features |
|
|
- Input: `[self_features || parent_features]` (2D dimensions) |
|
|
- Output: `D` dimensional update vector |
|
|
- Dropout regularization for robustness |
|
|
|
|
|
4. **Path Compression Potential**: Can extend to dynamic parent reassignment |
|
|
- Current implementation: static `parent[i] = i-1` chain |
|
|
- Future extension: learn parent assignments based on semantic similarity |
|
|
- Enables adaptive graph structure during forward pass |
|
|
|
|
|
**Union-Find vs. Other Graph Structures:** |
|
|
|
|
|
| Structure | Complexity | Receptive Field | Connections per Node | |
|
|
|-----------|------------|-----------------|----------------------| |
|
|
| **Full Attention** | O(n²) | Global | n-1 (all positions) | |
|
|
| **Skip-List** | O(n log n) | Multi-scale | O(log n) (multiple levels) | |
|
|
| **Union-Find** | O(n) | Local chain | 1 (parent only) | |
|
|
| **Dilated Conv** | O(n·k) | Sparse | k (fixed window) | |
|
|
|
|
|
Union-Find achieves the **lowest complexity** while maintaining effective information propagation through iterative K-step evolution. |
|
|
|
|
|
**Theoretical Foundation - Union-Find in Graph Algorithms:** |
|
|
|
|
|
Union-Find is a classic data structure for disjoint set operations: |
|
|
- **Find**: Determine which set an element belongs to (with path compression: O(α(n)) ≈ O(1)) |
|
|
- **Union**: Merge two sets into one |
|
|
- **Applications**: Kruskal's MST algorithm, connected components, cycle detection |
|
|
|
|
|
In Asterisk-Pi: |
|
|
- Each token position is a node in the graph |
|
|
- Parent pointers define the tree structure |
|
|
- Message passing simulates "Find" operations (traversing to ancestors) |
|
|
- Can extend to dynamic "Union" operations (merging related tokens) |
|
|
|
|
|
**Multi-Step Propagation:** |
|
|
|
|
|
With K=4 evolution steps, information flow becomes: |
|
|
``` |
|
|
Step 1: Position i accesses parent i-1 |
|
|
Step 2: Position i now has information from i-2 (via i-1) |
|
|
Step 3: Position i now has information from i-3 (propagated through chain) |
|
|
Step 4: Position i now has information from i-4 (fully propagated) |
|
|
|
|
|
Result: Each position has aggregated context from 4 previous positions |
|
|
through efficient O(n) operations |
|
|
``` |
|
|
|
|
|
This multi-step propagation is crucial for: |
|
|
- **Local context**: Recent tokens for coherence |
|
|
- **Gradient flow**: Direct paths for backpropagation |
|
|
- **Efficiency**: Linear cost instead of quadratic attention |
|
|
|
|
|
**Fusion mechanism:** |
|
|
``` |
|
|
aspp_out = ASPP(hidden_states) # Union-Find graph propagation (O(n)) |
|
|
attn_out = Attention(hidden_states, mask, ...) # Global attention (O(n²)) |
|
|
gate = sigmoid(linear([aspp_out || attn_out])) |
|
|
fused = gate * aspp_out + (1 - gate) * attn_out |
|
|
|
|
|
# Combines: |
|
|
# - Local structured reasoning (ASPP via Union-Find) |
|
|
# - Global contextual awareness (Attention) |
|
|
``` |
|
|
|
|
|
### 2. π-Flow Refinement (Per-Layer) |
|
|
|
|
|
```python |
|
|
# Added to each hybrid layer |
|
|
self.pi_flow_aspp = ASPPOperator(...) # Velocity field network |
|
|
self.pi_flow_scale = Parameter(0.2) # Learnable flow strength |
|
|
self.pi_flow_gate = MLP(hidden_size -> 1) # Token-wise adaptive gating |
|
|
``` |
|
|
|
|
|
**π-Flow forward pass:** |
|
|
``` |
|
|
function π_flow_refinement(hidden_states): |
|
|
for step = 1 to π_flow_steps: |
|
|
# Compute velocity field using dedicated ASPP |
|
|
v = pi_flow_aspp(hidden_states) |
|
|
|
|
|
# Adaptive per-token gating |
|
|
gate = sigmoid(pi_flow_gate(hidden_states)) # [B, L, 1] |
|
|
alpha = pi_flow_scale * gate |
|
|
|
|
|
# Euler step in probability space |
|
|
hidden_states = hidden_states + alpha * v |
|
|
|
|
|
return hidden_states |
|
|
``` |
|
|
|
|
|
**Key design choices:** |
|
|
1. **Per-layer π-flow**: Each of 30 layers has independent π-flow parameters |
|
|
2. **Learnable scale**: `pi_flow_scale` adapts flow strength during training |
|
|
3. **Token-wise gating**: Different tokens get different flow magnitudes |
|
|
4. **ASPP velocity**: Reuses ASPP architecture for computing v(h) |
|
|
|
|
|
### 3. Complete Layer Pseudocode |
|
|
|
|
|
``` |
|
|
function HybridLayerWithPiFlow(hidden_states, attention_mask, ...): |
|
|
residual = hidden_states |
|
|
hidden_states = input_layernorm(hidden_states) |
|
|
|
|
|
# === Hybrid ASPP-Attention (Base Asterisk) === |
|
|
aspp_output = aspp_operator(hidden_states) |
|
|
attn_output = self_attention(hidden_states, attention_mask, ...) |
|
|
|
|
|
# Gated fusion |
|
|
fusion_input = concat([aspp_output, attn_output]) |
|
|
gate = sigmoid(linear(dropout(fusion_input))) |
|
|
fused_output = gate * aspp_output + (1 - gate) * attn_output |
|
|
|
|
|
# Residual connection |
|
|
hidden_states = residual + fused_output |
|
|
|
|
|
# === π-Flow Refinement (NEW) === |
|
|
for step in [1..pi_flow_steps]: |
|
|
v = pi_flow_aspp(hidden_states) |
|
|
alpha = pi_flow_scale * sigmoid(pi_flow_gate(hidden_states)) |
|
|
hidden_states = hidden_states + alpha * v |
|
|
|
|
|
# === MLP Block === |
|
|
residual = hidden_states |
|
|
hidden_states = post_attention_layernorm(hidden_states) |
|
|
hidden_states = mlp(hidden_states) |
|
|
hidden_states = residual + hidden_states |
|
|
|
|
|
return hidden_states |
|
|
``` |
|
|
|
|
|
## Parameter Breakdown |
|
|
|
|
|
| Component | Parameters | Notes | |
|
|
|-----------|------------|-------| |
|
|
| **Base SmolLM2** | 135.6M | Embeddings, attention, MLP | |
|
|
| **ASPP Operators** | 35.5M | 30 layers × ~1.2M each | |
|
|
| **π-Flow ASPPs** | 2.3M | 30 layers × ~77k each | |
|
|
| **π-Flow Gates** | 0.2M | 30 layers × ~7k each | |
|
|
| **π-Flow Scales** | 30 | 30 learnable scalars | |
|
|
| **Total** | **173.7M** | +28% vs base SmolLM2 | |
|
|
|
|
|
π-Flow adds only **1.4% more parameters** (2.5M) compared to base Asterisk (171.2M) while providing 60 total refinement steps. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"NoesisLab/Asterisk-Pi", |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Asterisk-Pi") |
|
|
|
|
|
# Generate text |
|
|
messages = [{"role": "user", "content": "Explain the waterfall model in software engineering."}] |
|
|
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Dataset |
|
|
|
|
|
Mixed benchmark dataset for testing true capabilities: |
|
|
|
|
|
| Dataset | Ratio | Purpose | |
|
|
|---------|-------|---------| |
|
|
| **GSM8K** | 25% | Math reasoning benchmark | |
|
|
| **HellaSwag** | 30% | Commonsense reasoning benchmark | |
|
|
| **ARC** | 20% | Science QA (Easy + Challenge) | |
|
|
| **OpenHermes** | 10% | High-quality long-form responses | |
|
|
| **Capybara** | 15% | Multi-turn conversations | |
|
|
|
|
|
Total: ~10,148 training samples |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Starting Point**: Asterisk checkpoint (base ASPP-Attention model) |
|
|
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.1) |
|
|
- **Batch Size**: 2 per device, gradient accumulation=4 (effective batch=8) |
|
|
- **Epochs**: 2 |
|
|
- **Scheduler**: Linear warmup (10% of steps) |
|
|
- **Mixed Precision**: bfloat16 |
|
|
- **Gradient Checkpointing**: Enabled |
|
|
- **Max Grad Norm**: 1.0 |
|
|
|
|
|
### π-Flow Configuration |
|
|
|
|
|
```python |
|
|
pi_flow = True |
|
|
pi_flow_steps = 2 # 2 refinement steps per layer |
|
|
pi_flow_scale = 1.0 # Initial flow strength |
|
|
pi_flow_use_gate = True # Token-wise adaptive gating |
|
|
``` |
|
|
|
|
|
### ASPP Configuration (Inherited from Base) |
|
|
|
|
|
```python |
|
|
aspp_hidden_dim = 256 # Internal dimension (vs 576 model hidden_size) |
|
|
aspp_num_steps = 4 # Evolution steps for ASPP |
|
|
aspp_dropout = 0.2 # Regularization |
|
|
hybrid_layer_indices = None # All 30 layers |
|
|
``` |
|
|
|
|
|
## Model Creation from Base Asterisk |
|
|
|
|
|
```python |
|
|
from AsteriskForCausalLM import AsteriskForCausalLM |
|
|
from safetensors.torch import load_file |
|
|
import torch |
|
|
|
|
|
# Load Asterisk config and inject π-flow parameters |
|
|
from AsteriskForCausalLM import AsteriskConfig |
|
|
config = AsteriskConfig.from_pretrained("path/to/Asterisk", trust_remote_code=True) |
|
|
|
|
|
# Add π-flow configuration |
|
|
config.pi_flow = True |
|
|
config.pi_flow_steps = 2 |
|
|
config.pi_flow_scale = 1.0 |
|
|
config.pi_flow_use_gate = True |
|
|
|
|
|
# Create model with π-flow |
|
|
model = AsteriskForCausalLM(config) |
|
|
|
|
|
# Load pretrained Asterisk weights (strict=False ignores new π-flow params) |
|
|
state_dict = load_file("path/to/Asterisk/model.safetensors") |
|
|
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False) |
|
|
|
|
|
# π-flow parameters are randomly initialized |
|
|
print(f"New π-flow parameters: {len(missing_keys)}") |
|
|
|
|
|
# Move to device |
|
|
model = model.to(dtype=torch.bfloat16, device="cuda") |
|
|
``` |
|
|
|
|
|
## Theoretical Background |
|
|
|
|
|
### π-Flow: Probability Flow ODE |
|
|
|
|
|
Inspired by diffusion model score-based formulations: |
|
|
|
|
|
``` |
|
|
dx/dt = v(x, t) [Continuous probability flow] |
|
|
``` |
|
|
|
|
|
Discretized with Euler method: |
|
|
``` |
|
|
x_{t+1} = x_t + Δt * v(x_t) |
|
|
``` |
|
|
|
|
|
In Asterisk-Pi: |
|
|
- `x_t` = hidden states at layer output |
|
|
- `v(x_t)` = velocity field from dedicated ASPP |
|
|
- `Δt` = learnable `pi_flow_scale * gate(x_t)` |
|
|
|
|
|
### Multi-Scale Refinement |
|
|
|
|
|
- **Layer-level**: 30 hybrid layers with ASPP-Attention fusion |
|
|
- **π-Flow level**: 2 steps per layer = 60 total refinement operations |
|
|
- **ASPP-level**: 4 evolution steps within each ASPP = 240 micro-updates |
|
|
|
|
|
This creates a **hierarchical refinement cascade** enabling gradual convergence to high-quality representations. |
|
|
|
|
|
### Why π-Flow Helps |
|
|
|
|
|
1. **Iterative refinement**: Multiple passes allow correcting errors |
|
|
2. **Adaptive flow**: Token-wise gating focuses computation where needed |
|
|
3. **Gradient flow**: More direct paths for gradient propagation |
|
|
4. **Expressiveness**: Increases model capacity with minimal parameters |
|
|
|
|
|
## Implementation Details |
|
|
|
|
|
### Return Type Handling |
|
|
|
|
|
Critical for Transformers compatibility: |
|
|
|
|
|
```python |
|
|
# HybridASPPAttentionLayer.forward() returns tensor only |
|
|
def forward(self, hidden_states, ...) -> torch.Tensor: |
|
|
# ... ASPP + Attention + π-flow ... |
|
|
return hidden_states # ✅ Tensor, not tuple |
|
|
|
|
|
# This matches LlamaDecoderLayer API: -> torch.Tensor |
|
|
``` |
|
|
|
|
|
### Gradient Checkpointing Compatibility |
|
|
|
|
|
π-Flow is fully compatible with gradient checkpointing: |
|
|
- All operations are standard PyTorch ops |
|
|
- No custom CUDA kernels |
|
|
- Automatic differentiation through flow steps |
|
|
|
|
|
### Weight Initialization |
|
|
|
|
|
- **ASPP parameters**: Transferred from base Asterisk |
|
|
- **π-Flow ASPP**: Randomly initialized (Xavier uniform) |
|
|
- **π-Flow scale**: Initialized to 0.2 (conservative) |
|
|
- **π-Flow gate**: Initialized to output ~0.5 (balanced) |
|
|
|
|
|
## Files in Checkpoint |
|
|
|
|
|
``` |
|
|
Asterisk-Pi/ |
|
|
├── AsteriskForCausalLM.py # Model implementation (with π-flow) |
|
|
├── config.json # Model configuration |
|
|
├── model.safetensors # Model weights |
|
|
├── tokenizer.json # Tokenizer |
|
|
├── generation_config.json # Generation settings |
|
|
└── README.md # This file |
|
|
``` |
|
|
|
|
|
## Differences from Base Asterisk |
|
|
|
|
|
| Feature | Asterisk | Asterisk-Pi | |
|
|
|---------|----------|-------------| |
|
|
| **ASPP-Attention** | ✅ | ✅ | |
|
|
| **π-Flow Refinement** | ❌ | ✅ (per-layer) | |
|
|
| **Parameters** | 171.2M | 173.7M (+1.4%) | |
|
|
| **Refinement Steps** | 30 (layers) | 60 (30 layers × 2) | |
|
|
| **Training Dataset** | Capybara | Mixed Benchmarks | |
|
|
| **Complexity** | Medium | High | |
|
|
|
|
|
## Known Issues & Solutions |
|
|
|
|
|
### 1. Return Type Errors |
|
|
|
|
|
**Issue**: `AttributeError: 'tuple' object has no attribute 'dtype'` |
|
|
|
|
|
**Solution**: `HybridASPPAttentionLayer.forward()` must return `torch.Tensor` only, not tuple. This matches the `LlamaDecoderLayer` API in transformers 4.57.6. |
|
|
|
|
|
### 2. π-Flow in All Layers vs Final Layer |
|
|
|
|
|
**Initial approach**: π-flow only in final layer (limited expressiveness) |
|
|
|
|
|
**Current approach**: π-flow in all 30 hybrid layers for maximum refinement capability. |
|
|
|
|
|
### 3. Training Stability |
|
|
|
|
|
π-Flow can cause instability with high learning rates. Use: |
|
|
- Lower learning rate (5e-4 vs 2e-5 for base) |
|
|
- Gradient clipping (max_norm=1.0) |
|
|
- Conservative initial flow scale (0.2-1.0) |
|
|
|
|
|
## Dependencies |
|
|
|
|
|
```bash |
|
|
pip install torch>=2.0.0 |
|
|
pip install transformers>=4.40.0 |
|
|
pip install trl>=0.8.0 |
|
|
pip install datasets>=2.14.0 |
|
|
pip install accelerate>=0.25.0 |
|
|
pip install bitsandbytes |
|
|
pip install safetensors |
|
|
``` |
|
|
|
|
|
## Citations |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{asteriskpi2026, |
|
|
title={Asterisk-Pi: Probability Flow Refinement for Hybrid ASPP-Attention Models}, |
|
|
author={NoesisLab}, |
|
|
year={2026}, |
|
|
publisher={Huggingface}, |
|
|
url={https://huggingface.co/NoesisLab/Asterisk-Pi} |
|
|
} |
|
|
``` |
|
|
|
|
|
```bibtex |
|
|
@misc{asterisk2026, |
|
|
title={Asterisk: Hybrid ASPP-Attention Architecture for Enhanced Language Modeling}, |
|
|
author={NoesisLab}, |
|
|
year={2026}, |
|
|
publisher={Huggingface}, |
|
|
url={https://huggingface.co/NoesisLab/Asterisk} |
|
|
} |
|
|
``` |
|
|
|
|
|
```bibtex |
|
|
@misc{vonwerra2022trl, |
|
|
title={{TRL: Transformer Reinforcement Learning}}, |
|
|
author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec}, |
|
|
year={2020}, |
|
|
journal={GitHub repository}, |
|
|
publisher={GitHub}, |
|
|
howpublished={\url{https://github.com/huggingface/trl}} |
|
|
} |
|
|
``` |
|
|
|
|
|
```bibtex |
|
|
@article{allal2024SmolLM2, |
|
|
title={SmolLM2 - with great data, comes great performance}, |
|
|
author={Allal, Loubna Ben and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Work |
|
|
|
|
|
- **Diffusion Models**: π-flow inspired by probability flow ODEs in score-based diffusion |
|
|
- **Neural ODEs**: Continuous-depth models with adaptive computation |
|
|
- **Iterative Refinement**: Multi-pass decoding in sequence models |
|
|
|
|
|
## Future Directions |
|
|
|
|
|
1. **Adaptive π-flow steps**: Learn number of refinement steps per layer |
|
|
2. **Higher-order ODE solvers**: Replace Euler with RK4 or adaptive schemes |
|
|
3. **Stochastic π-flow**: Add noise injection for exploration |
|
|
4. **Cross-layer π-flow**: Allow information flow between distant layers |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the Apache 2.0 license from SmolLM2-135M-Instruct. |
|
|
|
|
|
## Framework Versions |
|
|
|
|
|
- **TRL**: 0.27.0 |
|
|
- **Transformers**: 4.57.6 |
|
|
- **PyTorch**: 2.8.0+cu128 |
|
|
- **Datasets**: 4.5.0 |
|
|
- **Tokenizers**: 0.22.2 |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Built on top of: |
|
|
- [Asterisk](https://huggingface.co/NoesisLab/Asterisk) - Base ASPP-Attention architecture |
|
|
- [SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) - Foundation model |
|
|
- [TRL](https://github.com/huggingface/trl) - Training framework |
|
|
|
|
|
Special thanks to the diffusion model community for probability flow ODE insights. |
|
|
|