File size: 11,499 Bytes

---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- mamba
- mixture-of-experts
- state-space-models
- hybrid-architecture
---

# AdaptiveRiverLM-1B

**A Hybrid Mamba-SSM + Mixture-of-Experts Language Model**

AdaptiveRiverLM (also known as ROE - River of Experts) is an experimental hybrid architecture combining State Space Models (Mamba) with Mixture-of-Experts (MoE) layers for efficient and adaptive language modeling.

## Model Status

⚠️ **Experimental Release** ⚠️

This model is a test model with a single pass through a collection of synthetic mathematics datasets at a budget of 1. Further training runs are to be performed soon should the architecture prove resourceful.

**Extensive tests have not been performed yet.**

## Architecture Overview

<p align="center">
  <img src="https://huggingface.co/Alienanthony/ROE_EDU_BASE_Undercooked/resolve/main/ROE_Build.svg?download=true" alt="AdaptiveRiverLM Architecture" width="800px">
</p>

### ROE (River of Experts) / AdaptiveLM Architecture

The main theory behind the ROE architecture is made to target a range of setups while preserving the performance of a model of the same size or higher.

By preserving complexity with multiple layers of experts, we allow for higher performance in both token generation and model prediction.

### Three-Zone Design

```
┌─────────────────────────────────────────┐
│  Zone 1: Early Mamba (Layers 0-1)       │
│  • Fast sequence processing             │
│  • O(n) complexity                      │
│  • d_state=16, expand=2                 │
├─────────────────────────────────────────┤
│  Zone 2: MoE Layers (Layers 2-21)       │
│  • Conditional computation              │
│  • MoE Attention: 6 experts             │
│  • MoE FFN: 4 experts                   │
│  • Dynamic expert routing               │
├─────────────────────────────────────────┤
│  Zone 3: Enhanced Mamba (Layers 22-23)  │
│  • High-capacity refinement             │
│  • d_state=16, expand=4 (2× capacity)   │
└─────────────────────────────────────────┘
```

### Key Features

- **Hybrid Architecture**: Combines Mamba State Space Models (SSM) with Mixture-of-Experts
- **24 Layers**: 2 early Mamba + 20 MoE + 2 enhanced Mamba
- **~1B Parameters**: Efficient parameter usage through sparse activation
- **Budget-Aware Inference**: Runtime control over active experts (speed vs. quality tradeoff)
- **Stateless Generation**: Simplified deployment without KV cache requirements

### Model Specifications

| Parameter | Value |
|-----------|-------|
| Total Layers | 24 |
| Hidden Size (d_model) | 1024 |
| FFN Hidden Size | 4096 |
| Vocabulary Size | 50,257 |
| Attention Experts | 6 per MoE layer |
| FFN Experts | 4 per MoE layer |
| Mamba d_state | 16 |
| Mamba d_conv | 4 |
| Position Encoding | RoPE (Rotary Position Embedding) |

## Requirements

```bash
pip install torch transformers
pip install mamba-ssm  # Required for Mamba layers
```

**Note**: The `mamba-ssm` package is required for the model to function. Without it, Mamba layers will be non-functional.

## Usage
```bash
python inference_tester.py --model_dir /path/to/adaptiveriverlm --interactive
```

## Architecture Details

### Mamba Blocks

Early and late Mamba layers use selective state space models for efficient O(n) sequence processing:

- **Early Layers (0-1)**: Standard capacity (expand=2)
- **Late Layers (22-23)**: Enhanced capacity (expand=4)

### MoE Layers

Middle layers use conditional computation with dynamic expert routing:

**MoE Attention**:
- 6 expert attention heads per layer
- Dynamic top-k selection (typically 4-6 active)
- Per-expert Q/K/V projections with RoPE
- Router network for content-based selection

**MoE FFN**:
- 4 expert feed-forward networks per layer
- Token-level routing with top-k gating
- Straight-through estimator for differentiable routing
- Load balancing to ensure expert utilization

### Training Objectives

The model is trained with multiple auxiliary losses:
- **Load Balancing Loss**: Ensures even expert utilization
- **Router Z-Loss**: Prevents logit magnitude explosion
- **Entropy Regularization**: Encourages diverse expert selection

## Adaptive Expert Selection

One of the key innovations of AdaptiveRiverLM is its ability to maintain strong performance while dynamically adjusting the number of active attention experts. This allows the model to adapt to different computational budgets and deployment scenarios without requiring separate model checkpoints.

### How Expert Scaling Works

The model's MoE attention layers contain **6 expert heads** each, but not all experts need to be active for every input. The router network intelligently selects which experts to activate based on:

1. **Input Content**: Content-based routing determines which experts are most relevant
2. **Budget Ratio**: User-defined parameter controlling the expert activation range (0.0 to 1.0)

### Expert Activation Formulas

The model uses different scaling strategies for attention and FFN experts:

```python
# Attention expert selection (more aggressive scaling)
k_attention = max(1, int(round(top_k * (0.25 + 0.75 * budget_ratio))))
# Example: top_k=6, budget_ratio=0.5 → k_target = 3 experts (50% active)
# Example: top_k=6, budget_ratio=0.8 → k_target = 5 experts (83% active)

# FFN expert selection (conservative scaling)
k_ffn = max(1, int(round(base_top_k * (0.5 + budget_ratio / 2.0))))
# Example: base_top_k=2, budget_ratio=0.5 → k_target = 1-2 experts
# Example: base_top_k=2, budget_ratio=1.0 → k_target = 2 experts
```

**Key Insight**: Attention experts scale more aggressively (25-100%) while FFN experts scale conservatively (50-100%), as attention routing has been found to be more critical for maintaining quality.

### Performance Characteristics by Budget

| Budget Ratio | Active Attn Experts | Active FFN Experts | Relative Speed | Quality Retention | Recommended Use Case |
|--------------|---------------------|--------------------|--------------:|------------------:|----------------------|
| 1.0 (Full) | 6/6 (100%) | 1/4 (25%) | 1.0× | 100% | Maximum quality, complex reasoning |
| 0.9 | 5-6/6 (83-100%) | 1/4 (25%) | ~1.1× | 95-98% | High-quality production |
| 0.75 | 4-5/6 (67-83%) | 1/4 (25%) | ~1.4× | 90-95% | Balanced performance |
| 0.6 | 4/6 (67%) | 1/4 (25%) | ~1.7× | 85-90% | Efficient inference |
| 0.5 | 3/6 (50%) | 1/4 (25%) | ~2.0× | 80-85% | Fast generation, good quality |
| 0.35 | 2-3/6 (33-50%) | 1/4 (25%) | ~2.3× | 70-80% | Speed-optimized |
| 0.25 | 2/6 (33%) | 1/4 (25%) | ~2.5× | 60-75% | Minimal mode, basic tasks |

**Important Notes**:
- Quality retention percentages are task-dependent (simple tasks degrade less)
- Speed improvements are approximate and vary by hardware
- The model uses sparse activation, so even at full budget, not all parameters are active

### Why This Matters

**Graceful Degradation**: Unlike traditional transformers that operate at fixed capacity, AdaptiveRiverLM provides smooth quality/speed tradeoffs:

- **Budget = 1.0**: Full model capacity, all 6 attention experts available per layer
- **Budget = 0.5**: 50% fewer active attention parameters while maintaining 80-85% performance
- **Budget = 0.25**: Minimal mode (33% experts) suitable for simple queries or edge deployment

**Preserved Complexity**: Even at lower budgets, the model maintains architectural richness through:
- **Expert Specialization**: Different experts learn complementary skills during training
- **Intelligent Routing**: Most relevant experts are activated first (content-aware selection)
- **Hybrid Design**: Mamba layers provide stable base performance regardless of budget
- **Residual Connections**: Information bypassing when experts aren't needed

**Real-World Benefits**:
- Deploy one checkpoint across different hardware (high-end server to mobile edge)
- Dynamically adjust compute based on query complexity or available resources
- Batch processing with mixed budgets for different priority levels
- Cost optimization: use lower budgets for simple queries, full budget for complex reasoning
- A/B testing: compare quality degradation vs. speed gains for your specific use case

### Technical Implementation Details

The router architecture includes:
- **Temperature-scaled Gating**: Controlled via `gate_temperature=0.7` for smooth probability distributions
- **Straight-Through Estimator (STE)**: Enables differentiable top-k selection during training
- **Auxiliary Losses**:
  - Load Balancing Loss: `((usage - uniform)² ).sum()` prevents expert collapse
  - Router Z-Loss: `(logits²).mean()` prevents magnitude explosion
  - Entropy Regularization: Encourages diverse expert utilization
- **Top-k Masking**: Hard selection with soft backpropagation via STE

### Performance Validation

**Expected Behavior**:
- Tasks requiring broad knowledge benefit more from high budgets (budget ≥ 0.7)
- Narrow, specialized tasks show minimal degradation even at budget = 0.5
- Simple pattern matching (arithmetic, templates) works well at budget ≥ 0.3
- The Mamba layers (zones 1 and 3) provide stable performance regardless of MoE budget

**Recommended Testing**:
```python
# Benchmark across budgets for your specific use case
for budget in [1.0, 0.75, 0.5, 0.25]:
    results = evaluate_model(test_set, budget_ratio=budget)
    print(f"Budget {budget}: Accuracy={results.accuracy}, Speed={results.tokens_per_sec}")
```

This adaptive mechanism is what allows AdaptiveRiverLM to maintain strong performance even when constrained to fewer active experts, making it particularly suitable for production deployments with varying resource availability, cost constraints, or latency requirements.

## Training Data

This initial release was trained on a collection of synthetic mathematics datasets, focusing on:
- Arithmetic reasoning
- Algebraic problem solving
- Mathematical expressions
- Step-by-step solutions

## Performance Notes

**Current Status**: This is an experimental architecture test. Comprehensive benchmarking is ongoing.

## Limitations

- **Experimental**: Architecture and training are still being refined
- **Limited Training**: Single-pass on synthetic math data only
- **No KV Cache**: Current implementation uses stateless generation
- **Untested at Scale**: Performance characteristics under investigation

## Roadmap

- [ ] Complete comprehensive benchmark suite
- [ ] Multi-pass training with diverse datasets
- [ ] Implement efficient KV caching for MoE attention
- [ ] Optimize expert routing strategies
- [ ] Scale to larger model sizes (3B, 7B variants)

## License and Use

This model is licensed under Apache 2.0. It is intended for research and educational use.

**Note**: This is an experimental model release. Results may vary significantly from production-ready models. Use with appropriate caution and validation.

## Acknowledgments

- **Mamba**: State Space Model architecture from [Mamba paper](https://arxiv.org/abs/2312.00752)
- **Mixture-of-Experts**: Inspired by modern MoE architectures
- Built with PyTorch and Hugging Face Transformers

---