|
|
--- |
|
|
base_model: t5-small |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- open-web-math/open-web-math |
|
|
tags: |
|
|
- text-generation |
|
|
- causal-lm |
|
|
- mamba |
|
|
- hrm |
|
|
- pytorch |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# CMBA-768M-OpenWebMath |
|
|
|
|
|
A 768M parameter Hierarchical Recurrent Memory (HRM) language model trained on high-quality math web text from OpenWebMath. This model uses **Mamba2 state-space models** instead of traditional attention mechanisms, enabling efficient long-range sequence modeling. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
**CMBA** (Causal Mamba-based Architecture) implements a hierarchical processing structure: |
|
|
|
|
|
- **Hierarchical Design**: Dual-level processing with H-layers (high-level abstraction) and L-layers (low-level specialists) |
|
|
- **Mamba2 Mixers**: State-space models replace attention for O(n) complexity vs O(n²) |
|
|
- **Adaptive Computation**: Halting mechanism allows variable compute per token (ACT-style pondering) |
|
|
- **Parameters**: ~768M total |
|
|
- **Context Length**: 1024 tokens |
|
|
|
|
|
### Configuration |
|
|
```python |
|
|
Model Dimensions: |
|
|
- d_model: 1024 |
|
|
- n_heads: 16 (for compatibility, not used in Mamba) |
|
|
- d_ff: 4096 |
|
|
- H_layers: 12 (high-level hierarchy) |
|
|
- L_layers: 12 (low-level processing) |
|
|
|
|
|
Mamba2 Settings: |
|
|
- d_state: 128 |
|
|
- expand: 2 |
|
|
- headdim: 64 |
|
|
- d_conv: 4 |
|
|
- ngroups: 1 |
|
|
|
|
|
Training: |
|
|
- Max halt steps: 1 |
|
|
- Block size: 1024 |
|
|
- Batch size: 64 (effective) |
|
|
- Learning rate: 3e-05 → 1e-06 |
|
|
- Weight decay: 0.1 |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Dataset**: [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |
|
|
- **Tokenizer**: `t5-small` (T5 SentencePiece) |
|
|
- **Vocab Size**: 32100 |
|
|
|
|
|
## Latest Performance (Epoch 0) |
|
|
|
|
|
- **Validation Loss**: `10.3766` |
|
|
- **Validation Perplexity**: `32099.98` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import T5Tokenizer |
|
|
from hrm_text1_mamba1_donor import HRMText1 |
|
|
|
|
|
tokenizer = T5Tokenizer.from_pretrained("t5-small") |
|
|
model = HRMText1.from_pretrained("Viharikvs/CMBA-768M-OpenWebMath") |
|
|
|
|
|
# Generate text |
|
|
input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids |
|
|
outputs = model.generate(input_ids, max_length=100) |
|
|
print(tokenizer.decode(outputs[0])) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{cmba-768m-openwebmath, |
|
|
author = {Vihari}, |
|
|
title = {CMBA-768M-OpenWebMath: Hierarchical Mamba-based Language Model}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/Viharikvs/CMBA-768M-OpenWebMath} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|