|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
- fr |
|
|
- code |
|
|
tags: |
|
|
- complexity |
|
|
- token-routed-mlp |
|
|
- flash-attention |
|
|
- causal-lm |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Complexity Base |
|
|
|
|
|
A Llama-style transformer with architectural improvements for efficiency and performance. |
|
|
|
|
|
## Architecture: Llama + Improvements |
|
|
|
|
|
Complexity builds on the Llama architecture with three key enhancements: |
|
|
|
|
|
| Component | Llama | Complexity | |
|
|
|-----------|-------|------------| |
|
|
| **MLP** | Dense FFN | **Token-Routed MLP** (4 experts, 1 active) | |
|
|
| **Attention** | Standard | **Flash Attention** via SDPA | |
|
|
| **Normalization** | RMSNorm only | RMSNorm + **QK Normalization** | |
|
|
|
|
|
### Token-Routed MLP |
|
|
|
|
|
Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on **token ID**: |
|
|
|
|
|
```python |
|
|
expert_idx = token_id % num_experts # Deterministic routing |
|
|
output = experts[expert_idx](hidden_states) |
|
|
``` |
|
|
|
|
|
**Benefits:** |
|
|
- No router network overhead |
|
|
- Deterministic, reproducible routing |
|
|
- 4x parameter efficiency (only 1/4 experts active) |
|
|
|
|
|
### QK Normalization |
|
|
|
|
|
Stabilizes attention at scale by normalizing Q and K before computing attention scores: |
|
|
|
|
|
```python |
|
|
q = self.q_norm(q) |
|
|
k = self.k_norm(k) |
|
|
attn = (q @ k.T) / sqrt(d) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Parameters**: ~100M |
|
|
- **Hidden size**: 768 |
|
|
- **Layers**: 12 |
|
|
- **Attention heads**: 12 (KV heads: 4) |
|
|
- **Experts**: 4 (1 active per token) |
|
|
- **Vocabulary**: 100K tokens |
|
|
- **Context**: 2048 tokens |
|
|
- **Training steps**: 10,000 |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install complexity-model pyllm-inference |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With PyLLM |
|
|
|
|
|
```bash |
|
|
pyllm serve Pacific-Prime/complexity-tiny |
|
|
``` |
|
|
|
|
|
### Python API |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity") |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Pacific-Prime/complexity", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
inputs = tokenizer("def fibonacci(n):", return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
print(tokenizer.decode(outputs[0])) |
|
|
``` |
|
|
|
|
|
## Comparison with Llama |
|
|
|
|
|
``` |
|
|
Llama: embed -> [Attn + FFN] x L -> output |
|
|
Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output |
|
|
↑ QK Norm ↑ 4 experts (1 active) |
|
|
``` |
|
|
|
|
|
Same parameter count, but: |
|
|
- **4x more total MLP parameters** (distributed across experts) |
|
|
- **Faster training** (QK norm stabilizes gradients) |
|
|
- **Better scaling** (sparse activation) |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## Links |
|
|
|
|
|
- [GitHub](https://github.com/Complexity-ML/complexity-framework) |
|
|
- [PyPI](https://pypi.org/project/complexity-framework/) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{complexity, |
|
|
title={Complexity: Token-Routed MLP Transformer}, |
|
|
author={Pacific Prime}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/Pacific-Prime/complexity} |
|
|
} |
|
|
``` |
|
|
|