File size: 2,809 Bytes
c8a112a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95b7c5f
 
 
 
 
c8a112a
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: cc-by-nc-4.0
language:
- en
- fr
- code
tags:
- complexity
- token-routed-mlp
- flash-attention
- causal-lm
library_name: transformers
pipeline_tag: text-generation
---

# Complexity Base

A Llama-style transformer with architectural improvements for efficiency and performance.

## Architecture: Llama + Improvements

Complexity builds on the Llama architecture with three key enhancements:

| Component | Llama | Complexity |
|-----------|-------|------------|
| **MLP** | Dense FFN | **Token-Routed MLP** (4 experts, 1 active) |
| **Attention** | Standard | **Flash Attention** via SDPA |
| **Normalization** | RMSNorm only | RMSNorm + **QK Normalization** |

### Token-Routed MLP

Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on **token ID**:

```python
expert_idx = token_id % num_experts  # Deterministic routing
output = experts[expert_idx](hidden_states)
```

**Benefits:**
- No router network overhead
- Deterministic, reproducible routing
- 4x parameter efficiency (only 1/4 experts active)

### QK Normalization

Stabilizes attention at scale by normalizing Q and K before computing attention scores:

```python
q = self.q_norm(q)
k = self.k_norm(k)
attn = (q @ k.T) / sqrt(d)
```

## Model Details

- **Parameters**: ~100M
- **Hidden size**: 768
- **Layers**: 12
- **Attention heads**: 12 (KV heads: 4)
- **Experts**: 4 (1 active per token)
- **Vocabulary**: 100K tokens
- **Context**: 2048 tokens
- **Training steps**: 10,000

## Installation

```bash
pip install complexity-model pyllm-inference
```

## Usage

### With PyLLM

```bash
pyllm serve Pacific-Prime/complexity-tiny
```

### Python API

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity")
model = AutoModelForCausalLM.from_pretrained(
    "Pacific-Prime/complexity",
    trust_remote_code=True
)

inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

## Comparison with Llama

```
Llama:      embed -> [Attn + FFN] x L -> output
Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output
                      ↑ QK Norm    ↑ 4 experts (1 active)
```

Same parameter count, but:
- **4x more total MLP parameters** (distributed across experts)
- **Faster training** (QK norm stabilizes gradients)
- **Better scaling** (sparse activation)

## License

Apache 2.0

## Links

- [GitHub](https://github.com/Complexity-ML/complexity-framework)
- [PyPI](https://pypi.org/project/complexity-framework/)

## Citation

```bibtex
@misc{complexity,
  title={Complexity: Token-Routed MLP Transformer},
  author={Pacific Prime},
  year={2025},
  url={https://huggingface.co/Pacific-Prime/complexity}
}
```