File size: 11,499 Bytes
42d70c2 0655acd 70b9f14 0655acd 42d70c2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- mamba
- mixture-of-experts
- state-space-models
- hybrid-architecture
---
# AdaptiveRiverLM-1B
**A Hybrid Mamba-SSM + Mixture-of-Experts Language Model**
AdaptiveRiverLM (also known as ROE - River of Experts) is an experimental hybrid architecture combining State Space Models (Mamba) with Mixture-of-Experts (MoE) layers for efficient and adaptive language modeling.
## Model Status
β οΈ **Experimental Release** β οΈ
This model is a test model with a single pass through a collection of synthetic mathematics datasets at a budget of 1. Further training runs are to be performed soon should the architecture prove resourceful.
**Extensive tests have not been performed yet.**
## Architecture Overview
<p align="center">
<img src="https://huggingface.co/Alienanthony/ROE_EDU_BASE_Undercooked/resolve/main/ROE_Build.svg?download=true" alt="AdaptiveRiverLM Architecture" width="800px">
</p>
### ROE (River of Experts) / AdaptiveLM Architecture
The main theory behind the ROE architecture is made to target a range of setups while preserving the performance of a model of the same size or higher.
By preserving complexity with multiple layers of experts, we allow for higher performance in both token generation and model prediction.
### Three-Zone Design
```
βββββββββββββββββββββββββββββββββββββββββββ
β Zone 1: Early Mamba (Layers 0-1) β
β β’ Fast sequence processing β
β β’ O(n) complexity β
β β’ d_state=16, expand=2 β
βββββββββββββββββββββββββββββββββββββββββββ€
β Zone 2: MoE Layers (Layers 2-21) β
β β’ Conditional computation β
β β’ MoE Attention: 6 experts β
β β’ MoE FFN: 4 experts β
β β’ Dynamic expert routing β
βββββββββββββββββββββββββββββββββββββββββββ€
β Zone 3: Enhanced Mamba (Layers 22-23) β
β β’ High-capacity refinement β
β β’ d_state=16, expand=4 (2Γ capacity) β
βββββββββββββββββββββββββββββββββββββββββββ
```
### Key Features
- **Hybrid Architecture**: Combines Mamba State Space Models (SSM) with Mixture-of-Experts
- **24 Layers**: 2 early Mamba + 20 MoE + 2 enhanced Mamba
- **~1B Parameters**: Efficient parameter usage through sparse activation
- **Budget-Aware Inference**: Runtime control over active experts (speed vs. quality tradeoff)
- **Stateless Generation**: Simplified deployment without KV cache requirements
### Model Specifications
| Parameter | Value |
|-----------|-------|
| Total Layers | 24 |
| Hidden Size (d_model) | 1024 |
| FFN Hidden Size | 4096 |
| Vocabulary Size | 50,257 |
| Attention Experts | 6 per MoE layer |
| FFN Experts | 4 per MoE layer |
| Mamba d_state | 16 |
| Mamba d_conv | 4 |
| Position Encoding | RoPE (Rotary Position Embedding) |
## Requirements
```bash
pip install torch transformers
pip install mamba-ssm # Required for Mamba layers
```
**Note**: The `mamba-ssm` package is required for the model to function. Without it, Mamba layers will be non-functional.
## Usage
```bash
python inference_tester.py --model_dir /path/to/adaptiveriverlm --interactive
```
## Architecture Details
### Mamba Blocks
Early and late Mamba layers use selective state space models for efficient O(n) sequence processing:
- **Early Layers (0-1)**: Standard capacity (expand=2)
- **Late Layers (22-23)**: Enhanced capacity (expand=4)
### MoE Layers
Middle layers use conditional computation with dynamic expert routing:
**MoE Attention**:
- 6 expert attention heads per layer
- Dynamic top-k selection (typically 4-6 active)
- Per-expert Q/K/V projections with RoPE
- Router network for content-based selection
**MoE FFN**:
- 4 expert feed-forward networks per layer
- Token-level routing with top-k gating
- Straight-through estimator for differentiable routing
- Load balancing to ensure expert utilization
### Training Objectives
The model is trained with multiple auxiliary losses:
- **Load Balancing Loss**: Ensures even expert utilization
- **Router Z-Loss**: Prevents logit magnitude explosion
- **Entropy Regularization**: Encourages diverse expert selection
## Adaptive Expert Selection
One of the key innovations of AdaptiveRiverLM is its ability to maintain strong performance while dynamically adjusting the number of active attention experts. This allows the model to adapt to different computational budgets and deployment scenarios without requiring separate model checkpoints.
### How Expert Scaling Works
The model's MoE attention layers contain **6 expert heads** each, but not all experts need to be active for every input. The router network intelligently selects which experts to activate based on:
1. **Input Content**: Content-based routing determines which experts are most relevant
2. **Budget Ratio**: User-defined parameter controlling the expert activation range (0.0 to 1.0)
### Expert Activation Formulas
The model uses different scaling strategies for attention and FFN experts:
```python
# Attention expert selection (more aggressive scaling)
k_attention = max(1, int(round(top_k * (0.25 + 0.75 * budget_ratio))))
# Example: top_k=6, budget_ratio=0.5 β k_target = 3 experts (50% active)
# Example: top_k=6, budget_ratio=0.8 β k_target = 5 experts (83% active)
# FFN expert selection (conservative scaling)
k_ffn = max(1, int(round(base_top_k * (0.5 + budget_ratio / 2.0))))
# Example: base_top_k=2, budget_ratio=0.5 β k_target = 1-2 experts
# Example: base_top_k=2, budget_ratio=1.0 β k_target = 2 experts
```
**Key Insight**: Attention experts scale more aggressively (25-100%) while FFN experts scale conservatively (50-100%), as attention routing has been found to be more critical for maintaining quality.
### Performance Characteristics by Budget
| Budget Ratio | Active Attn Experts | Active FFN Experts | Relative Speed | Quality Retention | Recommended Use Case |
|--------------|---------------------|--------------------|--------------:|------------------:|----------------------|
| 1.0 (Full) | 6/6 (100%) | 1/4 (25%) | 1.0Γ | 100% | Maximum quality, complex reasoning |
| 0.9 | 5-6/6 (83-100%) | 1/4 (25%) | ~1.1Γ | 95-98% | High-quality production |
| 0.75 | 4-5/6 (67-83%) | 1/4 (25%) | ~1.4Γ | 90-95% | Balanced performance |
| 0.6 | 4/6 (67%) | 1/4 (25%) | ~1.7Γ | 85-90% | Efficient inference |
| 0.5 | 3/6 (50%) | 1/4 (25%) | ~2.0Γ | 80-85% | Fast generation, good quality |
| 0.35 | 2-3/6 (33-50%) | 1/4 (25%) | ~2.3Γ | 70-80% | Speed-optimized |
| 0.25 | 2/6 (33%) | 1/4 (25%) | ~2.5Γ | 60-75% | Minimal mode, basic tasks |
**Important Notes**:
- Quality retention percentages are task-dependent (simple tasks degrade less)
- Speed improvements are approximate and vary by hardware
- The model uses sparse activation, so even at full budget, not all parameters are active
### Why This Matters
**Graceful Degradation**: Unlike traditional transformers that operate at fixed capacity, AdaptiveRiverLM provides smooth quality/speed tradeoffs:
- **Budget = 1.0**: Full model capacity, all 6 attention experts available per layer
- **Budget = 0.5**: 50% fewer active attention parameters while maintaining 80-85% performance
- **Budget = 0.25**: Minimal mode (33% experts) suitable for simple queries or edge deployment
**Preserved Complexity**: Even at lower budgets, the model maintains architectural richness through:
- **Expert Specialization**: Different experts learn complementary skills during training
- **Intelligent Routing**: Most relevant experts are activated first (content-aware selection)
- **Hybrid Design**: Mamba layers provide stable base performance regardless of budget
- **Residual Connections**: Information bypassing when experts aren't needed
**Real-World Benefits**:
- Deploy one checkpoint across different hardware (high-end server to mobile edge)
- Dynamically adjust compute based on query complexity or available resources
- Batch processing with mixed budgets for different priority levels
- Cost optimization: use lower budgets for simple queries, full budget for complex reasoning
- A/B testing: compare quality degradation vs. speed gains for your specific use case
### Technical Implementation Details
The router architecture includes:
- **Temperature-scaled Gating**: Controlled via `gate_temperature=0.7` for smooth probability distributions
- **Straight-Through Estimator (STE)**: Enables differentiable top-k selection during training
- **Auxiliary Losses**:
- Load Balancing Loss: `((usage - uniform)Β² ).sum()` prevents expert collapse
- Router Z-Loss: `(logitsΒ²).mean()` prevents magnitude explosion
- Entropy Regularization: Encourages diverse expert utilization
- **Top-k Masking**: Hard selection with soft backpropagation via STE
### Performance Validation
**Expected Behavior**:
- Tasks requiring broad knowledge benefit more from high budgets (budget β₯ 0.7)
- Narrow, specialized tasks show minimal degradation even at budget = 0.5
- Simple pattern matching (arithmetic, templates) works well at budget β₯ 0.3
- The Mamba layers (zones 1 and 3) provide stable performance regardless of MoE budget
**Recommended Testing**:
```python
# Benchmark across budgets for your specific use case
for budget in [1.0, 0.75, 0.5, 0.25]:
results = evaluate_model(test_set, budget_ratio=budget)
print(f"Budget {budget}: Accuracy={results.accuracy}, Speed={results.tokens_per_sec}")
```
This adaptive mechanism is what allows AdaptiveRiverLM to maintain strong performance even when constrained to fewer active experts, making it particularly suitable for production deployments with varying resource availability, cost constraints, or latency requirements.
## Training Data
This initial release was trained on a collection of synthetic mathematics datasets, focusing on:
- Arithmetic reasoning
- Algebraic problem solving
- Mathematical expressions
- Step-by-step solutions
## Performance Notes
**Current Status**: This is an experimental architecture test. Comprehensive benchmarking is ongoing.
## Limitations
- **Experimental**: Architecture and training are still being refined
- **Limited Training**: Single-pass on synthetic math data only
- **No KV Cache**: Current implementation uses stateless generation
- **Untested at Scale**: Performance characteristics under investigation
## Roadmap
- [ ] Complete comprehensive benchmark suite
- [ ] Multi-pass training with diverse datasets
- [ ] Implement efficient KV caching for MoE attention
- [ ] Optimize expert routing strategies
- [ ] Scale to larger model sizes (3B, 7B variants)
## License and Use
This model is licensed under Apache 2.0. It is intended for research and educational use.
**Note**: This is an experimental model release. Results may vary significantly from production-ready models. Use with appropriate caution and validation.
## Acknowledgments
- **Mamba**: State Space Model architecture from [Mamba paper](https://arxiv.org/abs/2312.00752)
- **Mixture-of-Experts**: Inspired by modern MoE architectures
- Built with PyTorch and Hugging Face Transformers
--- |