README.md · NoesisLab/Asterisk-Pi at main

Asterisk-Pi / README.md

OzTianlu

Upload 14 files

3356706 verified about 11 hours ago

preview code

raw

history blame contribute delete

21.3 kB

	---
	library_name: transformers
	model_name: Asterisk-Pi
	base_model: NoesisLab/Asterisk
	tags:
	- aspp
	- pi-flow
	- hybrid-architecture
	- graph-reasoning
	- probability-flow
	- sft
	- trl
	license: apache-2.0
	language:
	- en
	---

	# Asterisk-Pi: ASPP-Attention with π-Flow Refinement

	Asterisk-Pi is an enhanced version of the Asterisk model that adds π-flow (probability flow) refinement to the hybrid ASPP-Attention architecture. Building on the SmolLM2-135M base, Asterisk-Pi implements per-layer iterative refinement inspired by probability flow ODEs from diffusion models, enabling multi-step reasoning through continuous state evolution.

	## Model Description

	- Base Model: [Asterisk](https://huggingface.co/NoesisLab/Asterisk) (SmolLM2-135M-Instruct with ASPP)
	- Architecture: Hybrid ASPP-Attention + Per-Layer π-Flow (30 hybrid layers)
	- Parameters: 173.7M (37.5M ASPP + 2.5M π-flow parameters)
	- Training: Supervised Fine-Tuning on Mixed Benchmark Dataset
	- Framework: Transformers 4.57.6, TRL 0.27.0

	## Key Innovation: π-Flow Refinement

	π-Flow (Probability Flow) adds iterative refinement to each hybrid layer, inspired by continuous-time probability flow ODEs:

	```
	h' = h + α * v(h) [Euler discretization]
	```

	Where:
	- `v(h)` is the velocity field computed by a dedicated ASPP operator
	- `α` is a learnable per-token scaling factor (adaptive gating)
	- Applied after ASPP-Attention fusion in each layer

	This enables 60 total refinement steps (30 layers × 2 steps each) throughout the model, allowing gradual convergence to more refined representations.

	## Evaluation Results

	Evaluated on LM-Evaluation-Harness:

	\| Task \| Metric \| Asterisk-Pi<br>(173.7M) \| Asterisk<br>(171.2M) \| SmolLM2-135M<br>(135.6M) \| Gemma-3-270m-it<br>(270M) \| Δ vs Asterisk \| Δ vs SmolLM2 \| Δ vs Gemma-3 \|
	\|------\|--------\|-------------\|-----------------\|--------------\|----------------\|---------------\|--------------\|--------------\|
	\| ARC-Challenge \| acc_norm \| 0.3038 \| 0.2884 \| 0.2773 \| 0.2730 \| +0.0154 \| +0.0265 \| +0.0308 \|
	\| ARC-Easy \| acc_norm \| 0.5412 \| 0.5450 \| 0.4899 \| 0.5059 \| -0.0038 \| +0.0513 \| +0.0353 \|
	\| HellaSwag \| acc_norm \| 0.4207 \| 0.4430 \| 0.4293 \| 0.3937 \| -0.0223 \| -0.0086 \| +0.0270 \|
	\| PIQA \| acc_norm \| 0.6703 \| 0.6770 \| 0.6632 \| 0.6692 \| -0.0067 \| +0.0071 \| +0.0011 \|
	\| WinoGrande \| acc \| 0.5391 \| 0.5210 \| 0.5154 \| 0.5257 \| +0.0181 \| +0.0237 \| +0.0134 \|

	### Analysis

	π-Flow improvements over base Asterisk:
	- ARC-Challenge (+1.54%): More challenging reasoning benefits from iterative refinement
	- WinoGrande (+1.81%): Multi-step resolution helps with pronoun disambiguation

	Improvements over SmolLM2-135M base:
	- ARC-Challenge (+2.65%): Hybrid architecture + π-flow significantly improves complex reasoning
	- ARC-Easy (+5.13%): Strong gains on elementary science questions
	- WinoGrande (+2.37%): Better pronoun disambiguation through iterative refinement
	- PIQA (+0.71%): Modest gains on physical commonsense

	Outperforming Gemma-3-270m-it (with 96M fewer parameters):
	- ARC-Challenge (+3.08%): Superior reasoning despite being 35% smaller
	- ARC-Easy (+3.53%): Significant advantage on elementary science
	- HellaSwag (+2.70%): Much stronger commonsense reasoning
	- WinoGrande (+1.34%): Better coreference resolution
	- PIQA (+0.11%): Comparable physical reasoning

	Key insight: Asterisk-Pi (173.7M params) consistently outperforms the much larger Gemma-3-270m-it (270M params), demonstrating that the hybrid ASPP-Attention architecture with π-flow refinement achieves superior parameter efficiency. The structured reasoning approach enables better performance per parameter, especially on complex multi-step reasoning tasks.

	## Architecture

	### Overview

	![Asterisk-Pi Architecture](./Arch.png)

	Figure: Asterisk-Pi architecture showing the hybrid ASPP-Attention structure with π-flow refinement. Each of the 30 layers contains parallel ASPP and Attention branches, gated fusion, and iterative π-flow refinement using probability flow ODE.

	```
	Input → [30 Hybrid Layers with π-Flow] → Output

	Each Hybrid Layer:
	1. ASPP-Attention Fusion (from base Asterisk)
	2. π-Flow Refinement (NEW)
	3. Feed-Forward Network
	```

	### 1. Hybrid ASPP-Attention Layer (Base Asterisk)

	```python
	class HybridASPPAttentionLayer:
	"""
	Combines ASPP operator with standard attention

	Components:
	- ASPP operator: Local structured reasoning with Union-Find graph propagation
	- Standard attention: Global context
	- Gated fusion: Dynamic balancing
	"""
	```

	#### ASPP Operator: Union-Find Graph Propagation

	The ASPP operator uses a Union-Find (Disjoint Set Union) structure for efficient graph-based message passing. Unlike traditional attention's O(n²) complexity or skip-list's O(n log n), Union-Find achieves O(n) complexity with nearly constant-time operations.

	Graph Structure - Union-Find Parent Chain:

	```
	Position: [0] [1] [2] [3] [4] [5] ... [n-1]
	Parent: [0] ← 0 ← 1 ← 2 ← 3 ← 4 ... ← n-2
	(root)

	- Position 0: points to itself (root of the tree)
	- Position i (i>0): points to position i-1 (parent)
	- Forms a linear chain structure for sequential token relationships
	```

	This creates a directed acyclic graph (DAG) where information flows from children to parents, naturally capturing left-to-right sequential dependencies in language modeling.

	Graph Propagation Aggregation:

	Each ASPP evolution step performs parent-based message passing:

	```python
	# Pseudocode for one ASPP propagation step
	for position i in sequence:
	# 1. Find parent using Union-Find structure
	parent_idx = compute_parent_indices()[i] # O(1) with path compression

	# 2. Gather parent features
	parent_features = hidden_states[parent_idx]

	# 3. Message aggregation: combine self + parent
	message_input = concat([hidden_states[i], parent_features])

	# 4. Update via learned transformation
	new_state = message_net(message_input) # 2-layer MLP

	# 5. Scaled residual connection
	hidden_states[i] = hidden_states[i] + residual_scale * new_state
	hidden_states[i] = layer_norm(hidden_states[i])
	```

	Key properties of Union-Find propagation:

	1. O(n) Complexity: Each position performs exactly one parent lookup and one aggregation
	- No expensive attention computation (O(n²))
	- No multi-level skip connections (O(n log n))
	- Simple indexing operation: `parent_features = h[parent_indices]`

	2. Hierarchical Information Flow: After K steps, position i can access information from positions [i-K, i]
	- K=1: immediate parent only
	- K=2: grandparent (2 positions back)
	- K=4 (default): great-great-grandparent (4 positions back)
	- Information propagates through the chain structure

	3. Learnable Aggregation: The `message_net` MLP learns how to combine self and parent features
	- Input: `[self_features \|\| parent_features]` (2D dimensions)
	- Output: `D` dimensional update vector
	- Dropout regularization for robustness

	4. Path Compression Potential: Can extend to dynamic parent reassignment
	- Current implementation: static `parent[i] = i-1` chain
	- Future extension: learn parent assignments based on semantic similarity
	- Enables adaptive graph structure during forward pass

	Union-Find vs. Other Graph Structures:

	\| Structure \| Complexity \| Receptive Field \| Connections per Node \|
	\|-----------\|------------\|-----------------\|----------------------\|
	\| Full Attention \| O(n²) \| Global \| n-1 (all positions) \|
	\| Skip-List \| O(n log n) \| Multi-scale \| O(log n) (multiple levels) \|
	\| Union-Find \| O(n) \| Local chain \| 1 (parent only) \|
	\| Dilated Conv \| O(n·k) \| Sparse \| k (fixed window) \|

	Union-Find achieves the lowest complexity while maintaining effective information propagation through iterative K-step evolution.

	Theoretical Foundation - Union-Find in Graph Algorithms:

	Union-Find is a classic data structure for disjoint set operations:
	- Find: Determine which set an element belongs to (with path compression: O(α(n)) ≈ O(1))
	- Union: Merge two sets into one
	- Applications: Kruskal's MST algorithm, connected components, cycle detection

	In Asterisk-Pi:
	- Each token position is a node in the graph
	- Parent pointers define the tree structure
	- Message passing simulates "Find" operations (traversing to ancestors)
	- Can extend to dynamic "Union" operations (merging related tokens)

	Multi-Step Propagation:

	With K=4 evolution steps, information flow becomes:
	```
	Step 1: Position i accesses parent i-1
	Step 2: Position i now has information from i-2 (via i-1)
	Step 3: Position i now has information from i-3 (propagated through chain)
	Step 4: Position i now has information from i-4 (fully propagated)

	Result: Each position has aggregated context from 4 previous positions
	through efficient O(n) operations
	```

	This multi-step propagation is crucial for:
	- Local context: Recent tokens for coherence
	- Gradient flow: Direct paths for backpropagation
	- Efficiency: Linear cost instead of quadratic attention

	Fusion mechanism:
	```
	aspp_out = ASPP(hidden_states) # Union-Find graph propagation (O(n))
	attn_out = Attention(hidden_states, mask, ...) # Global attention (O(n²))
	gate = sigmoid(linear([aspp_out \|\| attn_out]))
	fused = gate * aspp_out + (1 - gate) * attn_out

	# Combines:
	# - Local structured reasoning (ASPP via Union-Find)
	# - Global contextual awareness (Attention)
	```

	### 2. π-Flow Refinement (Per-Layer)

	```python
	# Added to each hybrid layer
	self.pi_flow_aspp = ASPPOperator(...) # Velocity field network
	self.pi_flow_scale = Parameter(0.2) # Learnable flow strength
	self.pi_flow_gate = MLP(hidden_size -> 1) # Token-wise adaptive gating
	```

	π-Flow forward pass:
	```
	function π_flow_refinement(hidden_states):
	for step = 1 to π_flow_steps:
	# Compute velocity field using dedicated ASPP
	v = pi_flow_aspp(hidden_states)

	# Adaptive per-token gating
	gate = sigmoid(pi_flow_gate(hidden_states)) # [B, L, 1]
	alpha = pi_flow_scale * gate

	# Euler step in probability space
	hidden_states = hidden_states + alpha * v

	return hidden_states
	```

	Key design choices:
	1. Per-layer π-flow: Each of 30 layers has independent π-flow parameters
	2. Learnable scale: `pi_flow_scale` adapts flow strength during training
	3. Token-wise gating: Different tokens get different flow magnitudes
	4. ASPP velocity: Reuses ASPP architecture for computing v(h)

	### 3. Complete Layer Pseudocode

	```
	function HybridLayerWithPiFlow(hidden_states, attention_mask, ...):
	residual = hidden_states
	hidden_states = input_layernorm(hidden_states)

	# === Hybrid ASPP-Attention (Base Asterisk) ===
	aspp_output = aspp_operator(hidden_states)
	attn_output = self_attention(hidden_states, attention_mask, ...)

	# Gated fusion
	fusion_input = concat([aspp_output, attn_output])
	gate = sigmoid(linear(dropout(fusion_input)))
	fused_output = gate * aspp_output + (1 - gate) * attn_output

	# Residual connection
	hidden_states = residual + fused_output

	# === π-Flow Refinement (NEW) ===
	for step in [1..pi_flow_steps]:
	v = pi_flow_aspp(hidden_states)
	alpha = pi_flow_scale * sigmoid(pi_flow_gate(hidden_states))
	hidden_states = hidden_states + alpha * v

	# === MLP Block ===
	residual = hidden_states
	hidden_states = post_attention_layernorm(hidden_states)
	hidden_states = mlp(hidden_states)
	hidden_states = residual + hidden_states

	return hidden_states
	```

	## Parameter Breakdown

	\| Component \| Parameters \| Notes \|
	\|-----------\|------------\|-------\|
	\| Base SmolLM2 \| 135.6M \| Embeddings, attention, MLP \|
	\| ASPP Operators \| 35.5M \| 30 layers × ~1.2M each \|
	\| π-Flow ASPPs \| 2.3M \| 30 layers × ~77k each \|
	\| π-Flow Gates \| 0.2M \| 30 layers × ~7k each \|
	\| π-Flow Scales \| 30 \| 30 learnable scalars \|
	\| Total \| 173.7M \| +28% vs base SmolLM2 \|

	π-Flow adds only 1.4% more parameters (2.5M) compared to base Asterisk (171.2M) while providing 60 total refinement steps.

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	"NoesisLab/Asterisk-Pi",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Asterisk-Pi")

	# Generate text
	messages = [{"role": "user", "content": "Explain the waterfall model in software engineering."}]
	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

	outputs = model.generate(
	inputs,
	max_new_tokens=256,
	temperature=0.7,
	do_sample=True,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Training Dataset

	Mixed benchmark dataset for testing true capabilities:

	\| Dataset \| Ratio \| Purpose \|
	\|---------\|-------\|---------\|
	\| GSM8K \| 25% \| Math reasoning benchmark \|
	\| HellaSwag \| 30% \| Commonsense reasoning benchmark \|
	\| ARC \| 20% \| Science QA (Easy + Challenge) \|
	\| OpenHermes \| 10% \| High-quality long-form responses \|
	\| Capybara \| 15% \| Multi-turn conversations \|

	Total: ~10,148 training samples

	### Training Configuration

	- Starting Point: Asterisk checkpoint (base ASPP-Attention model)
	- Optimizer: AdamW (lr=5e-4, weight_decay=0.1)
	- Batch Size: 2 per device, gradient accumulation=4 (effective batch=8)
	- Epochs: 2
	- Scheduler: Linear warmup (10% of steps)
	- Mixed Precision: bfloat16
	- Gradient Checkpointing: Enabled
	- Max Grad Norm: 1.0

	### π-Flow Configuration

	```python
	pi_flow = True
	pi_flow_steps = 2 # 2 refinement steps per layer
	pi_flow_scale = 1.0 # Initial flow strength
	pi_flow_use_gate = True # Token-wise adaptive gating
	```

	### ASPP Configuration (Inherited from Base)

	```python
	aspp_hidden_dim = 256 # Internal dimension (vs 576 model hidden_size)
	aspp_num_steps = 4 # Evolution steps for ASPP
	aspp_dropout = 0.2 # Regularization
	hybrid_layer_indices = None # All 30 layers
	```

	## Model Creation from Base Asterisk

	```python
	from AsteriskForCausalLM import AsteriskForCausalLM
	from safetensors.torch import load_file
	import torch

	# Load Asterisk config and inject π-flow parameters
	from AsteriskForCausalLM import AsteriskConfig
	config = AsteriskConfig.from_pretrained("path/to/Asterisk", trust_remote_code=True)

	# Add π-flow configuration
	config.pi_flow = True
	config.pi_flow_steps = 2
	config.pi_flow_scale = 1.0
	config.pi_flow_use_gate = True

	# Create model with π-flow
	model = AsteriskForCausalLM(config)

	# Load pretrained Asterisk weights (strict=False ignores new π-flow params)
	state_dict = load_file("path/to/Asterisk/model.safetensors")
	missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)

	# π-flow parameters are randomly initialized
	print(f"New π-flow parameters: {len(missing_keys)}")

	# Move to device
	model = model.to(dtype=torch.bfloat16, device="cuda")
	```

	## Theoretical Background

	### π-Flow: Probability Flow ODE

	Inspired by diffusion model score-based formulations:

	```
	dx/dt = v(x, t) [Continuous probability flow]
	```

	Discretized with Euler method:
	```
	x_{t+1} = x_t + Δt * v(x_t)
	```

	In Asterisk-Pi:
	- `x_t` = hidden states at layer output
	- `v(x_t)` = velocity field from dedicated ASPP
	- `Δt` = learnable `pi_flow_scale * gate(x_t)`

	### Multi-Scale Refinement

	- Layer-level: 30 hybrid layers with ASPP-Attention fusion
	- π-Flow level: 2 steps per layer = 60 total refinement operations
	- ASPP-level: 4 evolution steps within each ASPP = 240 micro-updates

	This creates a hierarchical refinement cascade enabling gradual convergence to high-quality representations.

	### Why π-Flow Helps

	1. Iterative refinement: Multiple passes allow correcting errors
	2. Adaptive flow: Token-wise gating focuses computation where needed
	3. Gradient flow: More direct paths for gradient propagation
	4. Expressiveness: Increases model capacity with minimal parameters

	## Implementation Details

	### Return Type Handling

	Critical for Transformers compatibility:

	```python
	# HybridASPPAttentionLayer.forward() returns tensor only
	def forward(self, hidden_states, ...) -> torch.Tensor:
	# ... ASPP + Attention + π-flow ...
	return hidden_states # ✅ Tensor, not tuple

	# This matches LlamaDecoderLayer API: -> torch.Tensor
	```

	### Gradient Checkpointing Compatibility

	π-Flow is fully compatible with gradient checkpointing:
	- All operations are standard PyTorch ops
	- No custom CUDA kernels
	- Automatic differentiation through flow steps

	### Weight Initialization

	- ASPP parameters: Transferred from base Asterisk
	- π-Flow ASPP: Randomly initialized (Xavier uniform)
	- π-Flow scale: Initialized to 0.2 (conservative)
	- π-Flow gate: Initialized to output ~0.5 (balanced)

	## Files in Checkpoint

	```
	Asterisk-Pi/
	├── AsteriskForCausalLM.py # Model implementation (with π-flow)
	├── config.json # Model configuration
	├── model.safetensors # Model weights
	├── tokenizer.json # Tokenizer
	├── generation_config.json # Generation settings
	└── README.md # This file
	```

	## Differences from Base Asterisk

	\| Feature \| Asterisk \| Asterisk-Pi \|
	\|---------\|----------\|-------------\|
	\| ASPP-Attention \| ✅ \| ✅ \|
	\| π-Flow Refinement \| ❌ \| ✅ (per-layer) \|
	\| Parameters \| 171.2M \| 173.7M (+1.4%) \|
	\| Refinement Steps \| 30 (layers) \| 60 (30 layers × 2) \|
	\| Training Dataset \| Capybara \| Mixed Benchmarks \|
	\| Complexity \| Medium \| High \|

	## Known Issues & Solutions

	### 1. Return Type Errors

	Issue: `AttributeError: 'tuple' object has no attribute 'dtype'`

	Solution: `HybridASPPAttentionLayer.forward()` must return `torch.Tensor` only, not tuple. This matches the `LlamaDecoderLayer` API in transformers 4.57.6.

	### 2. π-Flow in All Layers vs Final Layer

	Initial approach: π-flow only in final layer (limited expressiveness)

	Current approach: π-flow in all 30 hybrid layers for maximum refinement capability.

	### 3. Training Stability

	π-Flow can cause instability with high learning rates. Use:
	- Lower learning rate (5e-4 vs 2e-5 for base)
	- Gradient clipping (max_norm=1.0)
	- Conservative initial flow scale (0.2-1.0)

	## Dependencies

	```bash
	pip install torch>=2.0.0
	pip install transformers>=4.40.0
	pip install trl>=0.8.0
	pip install datasets>=2.14.0
	pip install accelerate>=0.25.0
	pip install bitsandbytes
	pip install safetensors
	```

	## Citations

	If you use this model, please cite:

	```bibtex
	@misc{asteriskpi2026,
	title={Asterisk-Pi: Probability Flow Refinement for Hybrid ASPP-Attention Models},
	author={NoesisLab},
	year={2026},
	publisher={Huggingface},
	url={https://huggingface.co/NoesisLab/Asterisk-Pi}
	}
	```

	```bibtex
	@misc{asterisk2026,
	title={Asterisk: Hybrid ASPP-Attention Architecture for Enhanced Language Modeling},
	author={NoesisLab},
	year={2026},
	publisher={Huggingface},
	url={https://huggingface.co/NoesisLab/Asterisk}
	}
	```

	```bibtex
	@misc{vonwerra2022trl,
	title={{TRL: Transformer Reinforcement Learning}},
	author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
	year={2020},
	journal={GitHub repository},
	publisher={GitHub},
	howpublished={\url{https://github.com/huggingface/trl}}
	}
	```

	```bibtex
	@article{allal2024SmolLM2,
	title={SmolLM2 - with great data, comes great performance},
	author={Allal, Loubna Ben and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
	year={2024}
	}
	```

	## Related Work

	- Diffusion Models: π-flow inspired by probability flow ODEs in score-based diffusion
	- Neural ODEs: Continuous-depth models with adaptive computation
	- Iterative Refinement: Multi-pass decoding in sequence models

	## Future Directions

	1. Adaptive π-flow steps: Learn number of refinement steps per layer
	2. Higher-order ODE solvers: Replace Euler with RK4 or adaptive schemes
	3. Stochastic π-flow: Add noise injection for exploration
	4. Cross-layer π-flow: Allow information flow between distant layers

	## License

	This model inherits the Apache 2.0 license from SmolLM2-135M-Instruct.

	## Framework Versions

	- TRL: 0.27.0
	- Transformers: 4.57.6
	- PyTorch: 2.8.0+cu128
	- Datasets: 4.5.0
	- Tokenizers: 0.22.2

	## Acknowledgments

	Built on top of:
	- [Asterisk](https://huggingface.co/NoesisLab/Asterisk) - Base ASPP-Attention architecture
	- [SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) - Foundation model
	- [TRL](https://github.com/huggingface/trl) - Training framework

	Special thanks to the diffusion model community for probability flow ODE insights.