Asterisk-Pi: ASPP-Attention with π-Flow Refinement
Asterisk-Pi is an enhanced version of the Asterisk model that adds π-flow (probability flow) refinement to the hybrid ASPP-Attention architecture. Building on the SmolLM2-135M base, Asterisk-Pi implements per-layer iterative refinement inspired by probability flow ODEs from diffusion models, enabling multi-step reasoning through continuous state evolution.
Model Description
- Base Model: Asterisk (SmolLM2-135M-Instruct with ASPP)
- Architecture: Hybrid ASPP-Attention + Per-Layer π-Flow (30 hybrid layers)
- Parameters: 173.7M (37.5M ASPP + 2.5M π-flow parameters)
- Training: Supervised Fine-Tuning on Mixed Benchmark Dataset
- Framework: Transformers 4.57.6, TRL 0.27.0
Key Innovation: π-Flow Refinement
π-Flow (Probability Flow) adds iterative refinement to each hybrid layer, inspired by continuous-time probability flow ODEs:
h' = h + α * v(h) [Euler discretization]
Where:
v(h)is the velocity field computed by a dedicated ASPP operatorαis a learnable per-token scaling factor (adaptive gating)- Applied after ASPP-Attention fusion in each layer
This enables 60 total refinement steps (30 layers × 2 steps each) throughout the model, allowing gradual convergence to more refined representations.
Evaluation Results
Evaluated on LM-Evaluation-Harness:
| Task | Metric | Asterisk-Pi (173.7M) |
Asterisk (171.2M) |
SmolLM2-135M (135.6M) |
Gemma-3-270m-it (270M) |
Δ vs Asterisk | Δ vs SmolLM2 | Δ vs Gemma-3 |
|---|---|---|---|---|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3038 | 0.2884 | 0.2773 | 0.2730 | +0.0154 | +0.0265 | +0.0308 |
| ARC-Easy | acc_norm | 0.5412 | 0.5450 | 0.4899 | 0.5059 | -0.0038 | +0.0513 | +0.0353 |
| HellaSwag | acc_norm | 0.4207 | 0.4430 | 0.4293 | 0.3937 | -0.0223 | -0.0086 | +0.0270 |
| PIQA | acc_norm | 0.6703 | 0.6770 | 0.6632 | 0.6692 | -0.0067 | +0.0071 | +0.0011 |
| WinoGrande | acc | 0.5391 | 0.5210 | 0.5154 | 0.5257 | +0.0181 | +0.0237 | +0.0134 |
Analysis
π-Flow improvements over base Asterisk:
- ARC-Challenge (+1.54%): More challenging reasoning benefits from iterative refinement
- WinoGrande (+1.81%): Multi-step resolution helps with pronoun disambiguation
Improvements over SmolLM2-135M base:
- ARC-Challenge (+2.65%): Hybrid architecture + π-flow significantly improves complex reasoning
- ARC-Easy (+5.13%): Strong gains on elementary science questions
- WinoGrande (+2.37%): Better pronoun disambiguation through iterative refinement
- PIQA (+0.71%): Modest gains on physical commonsense
Outperforming Gemma-3-270m-it (with 96M fewer parameters):
- ARC-Challenge (+3.08%): Superior reasoning despite being 35% smaller
- ARC-Easy (+3.53%): Significant advantage on elementary science
- HellaSwag (+2.70%): Much stronger commonsense reasoning
- WinoGrande (+1.34%): Better coreference resolution
- PIQA (+0.11%): Comparable physical reasoning
Key insight: Asterisk-Pi (173.7M params) consistently outperforms the much larger Gemma-3-270m-it (270M params), demonstrating that the hybrid ASPP-Attention architecture with π-flow refinement achieves superior parameter efficiency. The structured reasoning approach enables better performance per parameter, especially on complex multi-step reasoning tasks.
Architecture
Overview
Figure: Asterisk-Pi architecture showing the hybrid ASPP-Attention structure with π-flow refinement. Each of the 30 layers contains parallel ASPP and Attention branches, gated fusion, and iterative π-flow refinement using probability flow ODE.
Input → [30 Hybrid Layers with π-Flow] → Output
Each Hybrid Layer:
1. ASPP-Attention Fusion (from base Asterisk)
2. π-Flow Refinement (NEW)
3. Feed-Forward Network
1. Hybrid ASPP-Attention Layer (Base Asterisk)
class HybridASPPAttentionLayer:
"""
Combines ASPP operator with standard attention
Components:
- ASPP operator: Local structured reasoning
- Standard attention: Global context
- Gated fusion: Dynamic balancing
"""
Fusion mechanism:
aspp_out = ASPP(hidden_states)
attn_out = Attention(hidden_states, mask, ...)
gate = sigmoid(linear([aspp_out || attn_out]))
fused = gate * aspp_out + (1 - gate) * attn_out
2. π-Flow Refinement (Per-Layer)
# Added to each hybrid layer
self.pi_flow_aspp = ASPPOperator(...) # Velocity field network
self.pi_flow_scale = Parameter(0.2) # Learnable flow strength
self.pi_flow_gate = MLP(hidden_size -> 1) # Token-wise adaptive gating
π-Flow forward pass:
function π_flow_refinement(hidden_states):
for step = 1 to π_flow_steps:
# Compute velocity field using dedicated ASPP
v = pi_flow_aspp(hidden_states)
# Adaptive per-token gating
gate = sigmoid(pi_flow_gate(hidden_states)) # [B, L, 1]
alpha = pi_flow_scale * gate
# Euler step in probability space
hidden_states = hidden_states + alpha * v
return hidden_states
Key design choices:
- Per-layer π-flow: Each of 30 layers has independent π-flow parameters
- Learnable scale:
pi_flow_scaleadapts flow strength during training - Token-wise gating: Different tokens get different flow magnitudes
- ASPP velocity: Reuses ASPP architecture for computing v(h)
3. Complete Layer Pseudocode
function HybridLayerWithPiFlow(hidden_states, attention_mask, ...):
residual = hidden_states
hidden_states = input_layernorm(hidden_states)
# === Hybrid ASPP-Attention (Base Asterisk) ===
aspp_output = aspp_operator(hidden_states)
attn_output = self_attention(hidden_states, attention_mask, ...)
# Gated fusion
fusion_input = concat([aspp_output, attn_output])
gate = sigmoid(linear(dropout(fusion_input)))
fused_output = gate * aspp_output + (1 - gate) * attn_output
# Residual connection
hidden_states = residual + fused_output
# === π-Flow Refinement (NEW) ===
for step in [1..pi_flow_steps]:
v = pi_flow_aspp(hidden_states)
alpha = pi_flow_scale * sigmoid(pi_flow_gate(hidden_states))
hidden_states = hidden_states + alpha * v
# === MLP Block ===
residual = hidden_states
hidden_states = post_attention_layernorm(hidden_states)
hidden_states = mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states
Parameter Breakdown
| Component | Parameters | Notes |
|---|---|---|
| Base SmolLM2 | 135.6M | Embeddings, attention, MLP |
| ASPP Operators | 35.5M | 30 layers × ~1.2M each |
| π-Flow ASPPs | 2.3M | 30 layers × ~77k each |
| π-Flow Gates | 0.2M | 30 layers × ~7k each |
| π-Flow Scales | 30 | 30 learnable scalars |
| Total | 173.7M | +28% vs base SmolLM2 |
π-Flow adds only 1.4% more parameters (2.5M) compared to base Asterisk (171.2M) while providing 60 total refinement steps.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"NoesisLab/Asterisk-Pi",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Asterisk-Pi")
# Generate text
messages = [{"role": "user", "content": "Explain the waterfall model in software engineering."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Dataset
Mixed benchmark dataset for testing true capabilities:
| Dataset | Ratio | Purpose |
|---|---|---|
| GSM8K | 25% | Math reasoning benchmark |
| HellaSwag | 30% | Commonsense reasoning benchmark |
| ARC | 20% | Science QA (Easy + Challenge) |
| OpenHermes | 10% | High-quality long-form responses |
| Capybara | 15% | Multi-turn conversations |
Total: ~10,148 training samples
Training Configuration
- Starting Point: Asterisk checkpoint (base ASPP-Attention model)
- Optimizer: AdamW (lr=5e-4, weight_decay=0.1)
- Batch Size: 2 per device, gradient accumulation=4 (effective batch=8)
- Epochs: 2
- Scheduler: Linear warmup (10% of steps)
- Mixed Precision: bfloat16
- Gradient Checkpointing: Enabled
- Max Grad Norm: 1.0
π-Flow Configuration
pi_flow = True
pi_flow_steps = 2 # 2 refinement steps per layer
pi_flow_scale = 1.0 # Initial flow strength
pi_flow_use_gate = True # Token-wise adaptive gating
ASPP Configuration (Inherited from Base)
aspp_hidden_dim = 256 # Internal dimension (vs 576 model hidden_size)
aspp_num_steps = 4 # Evolution steps for ASPP
aspp_dropout = 0.2 # Regularization
hybrid_layer_indices = None # All 30 layers
Model Creation from Base Asterisk
from AsteriskForCausalLM import AsteriskForCausalLM
from safetensors.torch import load_file
import torch
# Load Asterisk config and inject π-flow parameters
from AsteriskForCausalLM import AsteriskConfig
config = AsteriskConfig.from_pretrained("path/to/Asterisk", trust_remote_code=True)
# Add π-flow configuration
config.pi_flow = True
config.pi_flow_steps = 2
config.pi_flow_scale = 1.0
config.pi_flow_use_gate = True
# Create model with π-flow
model = AsteriskForCausalLM(config)
# Load pretrained Asterisk weights (strict=False ignores new π-flow params)
state_dict = load_file("path/to/Asterisk/model.safetensors")
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
# π-flow parameters are randomly initialized
print(f"New π-flow parameters: {len(missing_keys)}")
# Move to device
model = model.to(dtype=torch.bfloat16, device="cuda")
Theoretical Background
π-Flow: Probability Flow ODE
Inspired by diffusion model score-based formulations:
dx/dt = v(x, t) [Continuous probability flow]
Discretized with Euler method:
x_{t+1} = x_t + Δt * v(x_t)
In Asterisk-Pi:
x_t= hidden states at layer outputv(x_t)= velocity field from dedicated ASPPΔt= learnablepi_flow_scale * gate(x_t)
Multi-Scale Refinement
- Layer-level: 30 hybrid layers with ASPP-Attention fusion
- π-Flow level: 2 steps per layer = 60 total refinement operations
- ASPP-level: 4 evolution steps within each ASPP = 240 micro-updates
This creates a hierarchical refinement cascade enabling gradual convergence to high-quality representations.
Why π-Flow Helps
- Iterative refinement: Multiple passes allow correcting errors
- Adaptive flow: Token-wise gating focuses computation where needed
- Gradient flow: More direct paths for gradient propagation
- Expressiveness: Increases model capacity with minimal parameters
Implementation Details
Return Type Handling
Critical for Transformers compatibility:
# HybridASPPAttentionLayer.forward() returns tensor only
def forward(self, hidden_states, ...) -> torch.Tensor:
# ... ASPP + Attention + π-flow ...
return hidden_states # ✅ Tensor, not tuple
# This matches LlamaDecoderLayer API: -> torch.Tensor
Gradient Checkpointing Compatibility
π-Flow is fully compatible with gradient checkpointing:
- All operations are standard PyTorch ops
- No custom CUDA kernels
- Automatic differentiation through flow steps
Weight Initialization
- ASPP parameters: Transferred from base Asterisk
- π-Flow ASPP: Randomly initialized (Xavier uniform)
- π-Flow scale: Initialized to 0.2 (conservative)
- π-Flow gate: Initialized to output ~0.5 (balanced)
Files in Checkpoint
Asterisk-Pi/
├── AsteriskForCausalLM.py # Model implementation (with π-flow)
├── config.json # Model configuration
├── model.safetensors # Model weights
├── tokenizer.json # Tokenizer
├── generation_config.json # Generation settings
└── README.md # This file
Differences from Base Asterisk
| Feature | Asterisk | Asterisk-Pi |
|---|---|---|
| ASPP-Attention | ✅ | ✅ |
| π-Flow Refinement | ❌ | ✅ (per-layer) |
| Parameters | 171.2M | 173.7M (+1.4%) |
| Refinement Steps | 30 (layers) | 60 (30 layers × 2) |
| Training Dataset | Capybara | Mixed Benchmarks |
| Complexity | Medium | High |
Known Issues & Solutions
1. Return Type Errors
Issue: AttributeError: 'tuple' object has no attribute 'dtype'
Solution: HybridASPPAttentionLayer.forward() must return torch.Tensor only, not tuple. This matches the LlamaDecoderLayer API in transformers 4.57.6.
2. π-Flow in All Layers vs Final Layer
Initial approach: π-flow only in final layer (limited expressiveness)
Current approach: π-flow in all 30 hybrid layers for maximum refinement capability.
3. Training Stability
π-Flow can cause instability with high learning rates. Use:
- Lower learning rate (5e-4 vs 2e-5 for base)
- Gradient clipping (max_norm=1.0)
- Conservative initial flow scale (0.2-1.0)
Dependencies
pip install torch>=2.0.0
pip install transformers>=4.40.0
pip install trl>=0.8.0
pip install datasets>=2.14.0
pip install accelerate>=0.25.0
pip install bitsandbytes
pip install safetensors
Citations
If you use this model, please cite:
@misc{asteriskpi2026,
title={Asterisk-Pi: Probability Flow Refinement for Hybrid ASPP-Attention Models},
author={NoesisLab},
year={2026},
publisher={Huggingface},
url={https://huggingface.co/NoesisLab/Asterisk-Pi}
}
@misc{asterisk2026,
title={Asterisk: Hybrid ASPP-Attention Architecture for Enhanced Language Modeling},
author={NoesisLab},
year={2026},
publisher={Huggingface},
url={https://huggingface.co/NoesisLab/Asterisk}
}
@misc{vonwerra2022trl,
title={{TRL: Transformer Reinforcement Learning}},
author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
year={2020},
journal={GitHub repository},
publisher={GitHub},
howpublished={\url{https://github.com/huggingface/trl}}
}
@article{allal2024SmolLM2,
title={SmolLM2 - with great data, comes great performance},
author={Allal, Loubna Ben and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
year={2024}
}
Related Work
- Diffusion Models: π-flow inspired by probability flow ODEs in score-based diffusion
- Neural ODEs: Continuous-depth models with adaptive computation
- Iterative Refinement: Multi-pass decoding in sequence models
Future Directions
- Adaptive π-flow steps: Learn number of refinement steps per layer
- Higher-order ODE solvers: Replace Euler with RK4 or adaptive schemes
- Stochastic π-flow: Add noise injection for exploration
- Cross-layer π-flow: Allow information flow between distant layers
License
This model inherits the Apache 2.0 license from SmolLM2-135M-Instruct.
Framework Versions
- TRL: 0.27.0
- Transformers: 4.57.6
- PyTorch: 2.8.0+cu128
- Datasets: 4.5.0
- Tokenizers: 0.22.2
Acknowledgments
Built on top of:
- Asterisk - Base ASPP-Attention architecture
- SmolLM2-135M-Instruct - Foundation model
- TRL - Training framework
Special thanks to the diffusion model community for probability flow ODE insights.
- Downloads last month
- -
Model tree for NoesisLab/Asterisk-Pi
Base model
HuggingFaceTB/SmolLM2-135M