Pacific-Prime commited on
Commit
c8a112a
·
verified ·
1 Parent(s): cd861b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -121
README.md CHANGED
@@ -1,121 +1,121 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- language:
4
- - en
5
- - fr
6
- - code
7
- tags:
8
- - complexity
9
- - token-routed-mlp
10
- - flash-attention
11
- - causal-lm
12
- library_name: transformers
13
- pipeline_tag: text-generation
14
- ---
15
-
16
- # Complexity Base
17
-
18
- A Llama-style transformer with architectural improvements for efficiency and performance.
19
-
20
- ## Architecture: Llama + Improvements
21
-
22
- Complexity builds on the Llama architecture with three key enhancements:
23
-
24
- | Component | Llama | Complexity |
25
- |-----------|-------|------------|
26
- | **MLP** | Dense FFN | **Token-Routed MLP** (4 experts, 1 active) |
27
- | **Attention** | Standard | **Flash Attention** via SDPA |
28
- | **Normalization** | RMSNorm only | RMSNorm + **QK Normalization** |
29
-
30
- ### Token-Routed MLP
31
-
32
- Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on **token ID**:
33
-
34
- ```python
35
- expert_idx = token_id % num_experts # Deterministic routing
36
- output = experts[expert_idx](hidden_states)
37
- ```
38
-
39
- **Benefits:**
40
- - No router network overhead
41
- - Deterministic, reproducible routing
42
- - 4x parameter efficiency (only 1/4 experts active)
43
-
44
- ### QK Normalization
45
-
46
- Stabilizes attention at scale by normalizing Q and K before computing attention scores:
47
-
48
- ```python
49
- q = self.q_norm(q)
50
- k = self.k_norm(k)
51
- attn = (q @ k.T) / sqrt(d)
52
- ```
53
-
54
- ## Model Details
55
-
56
- - **Parameters**: ~100M
57
- - **Hidden size**: 768
58
- - **Layers**: 12
59
- - **Attention heads**: 12 (KV heads: 4)
60
- - **Experts**: 4 (1 active per token)
61
- - **Vocabulary**: 100K tokens
62
- - **Context**: 2048 tokens
63
- - **Training steps**: 10,000
64
-
65
- ## Installation
66
-
67
- ```bash
68
- pip install complexity-model pyllm-inference
69
- ```
70
-
71
- ## Usage
72
-
73
- ### With PyLLM
74
-
75
- ```bash
76
- pyllm serve Pacific-Prime/complexity
77
- ```
78
-
79
- ### Python API
80
-
81
- ```python
82
- from transformers import AutoTokenizer, AutoModelForCausalLM
83
-
84
- tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity")
85
- model = AutoModelForCausalLM.from_pretrained(
86
- "Pacific-Prime/complexity",
87
- trust_remote_code=True
88
- )
89
-
90
- inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
91
- outputs = model.generate(**inputs, max_new_tokens=100)
92
- print(tokenizer.decode(outputs[0]))
93
- ```
94
-
95
- ## Comparison with Llama
96
-
97
- ```
98
- Llama: embed -> [Attn + FFN] x L -> output
99
- Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output
100
- ↑ QK Norm ↑ 4 experts (1 active)
101
- ```
102
-
103
- Same parameter count, but:
104
- - **4x more total MLP parameters** (distributed across experts)
105
- - **Faster training** (QK norm stabilizes gradients)
106
- - **Better scaling** (sparse activation)
107
-
108
- ## License
109
-
110
- Apache 2.0
111
-
112
- ## Citation
113
-
114
- ```bibtex
115
- @misc{complexity,
116
- title={Complexity: Token-Routed MLP Transformer},
117
- author={Pacific Prime},
118
- year={2025},
119
- url={https://huggingface.co/Pacific-Prime/complexity}
120
- }
121
- ```
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ - fr
6
+ - code
7
+ tags:
8
+ - complexity
9
+ - token-routed-mlp
10
+ - flash-attention
11
+ - causal-lm
12
+ library_name: transformers
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # Complexity Base
17
+
18
+ A Llama-style transformer with architectural improvements for efficiency and performance.
19
+
20
+ ## Architecture: Llama + Improvements
21
+
22
+ Complexity builds on the Llama architecture with three key enhancements:
23
+
24
+ | Component | Llama | Complexity |
25
+ |-----------|-------|------------|
26
+ | **MLP** | Dense FFN | **Token-Routed MLP** (4 experts, 1 active) |
27
+ | **Attention** | Standard | **Flash Attention** via SDPA |
28
+ | **Normalization** | RMSNorm only | RMSNorm + **QK Normalization** |
29
+
30
+ ### Token-Routed MLP
31
+
32
+ Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on **token ID**:
33
+
34
+ ```python
35
+ expert_idx = token_id % num_experts # Deterministic routing
36
+ output = experts[expert_idx](hidden_states)
37
+ ```
38
+
39
+ **Benefits:**
40
+ - No router network overhead
41
+ - Deterministic, reproducible routing
42
+ - 4x parameter efficiency (only 1/4 experts active)
43
+
44
+ ### QK Normalization
45
+
46
+ Stabilizes attention at scale by normalizing Q and K before computing attention scores:
47
+
48
+ ```python
49
+ q = self.q_norm(q)
50
+ k = self.k_norm(k)
51
+ attn = (q @ k.T) / sqrt(d)
52
+ ```
53
+
54
+ ## Model Details
55
+
56
+ - **Parameters**: ~100M
57
+ - **Hidden size**: 768
58
+ - **Layers**: 12
59
+ - **Attention heads**: 12 (KV heads: 4)
60
+ - **Experts**: 4 (1 active per token)
61
+ - **Vocabulary**: 100K tokens
62
+ - **Context**: 2048 tokens
63
+ - **Training steps**: 10,000
64
+
65
+ ## Installation
66
+
67
+ ```bash
68
+ pip install complexity-model pyllm-inference
69
+ ```
70
+
71
+ ## Usage
72
+
73
+ ### With PyLLM
74
+
75
+ ```bash
76
+ pyllm serve Pacific-Prime/complexity-tiny
77
+ ```
78
+
79
+ ### Python API
80
+
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForCausalLM
83
+
84
+ tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity")
85
+ model = AutoModelForCausalLM.from_pretrained(
86
+ "Pacific-Prime/complexity",
87
+ trust_remote_code=True
88
+ )
89
+
90
+ inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
91
+ outputs = model.generate(**inputs, max_new_tokens=100)
92
+ print(tokenizer.decode(outputs[0]))
93
+ ```
94
+
95
+ ## Comparison with Llama
96
+
97
+ ```
98
+ Llama: embed -> [Attn + FFN] x L -> output
99
+ Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output
100
+ ↑ QK Norm ↑ 4 experts (1 active)
101
+ ```
102
+
103
+ Same parameter count, but:
104
+ - **4x more total MLP parameters** (distributed across experts)
105
+ - **Faster training** (QK norm stabilizes gradients)
106
+ - **Better scaling** (sparse activation)
107
+
108
+ ## License
109
+
110
+ Apache 2.0
111
+
112
+ ## Citation
113
+
114
+ ```bibtex
115
+ @misc{complexity,
116
+ title={Complexity: Token-Routed MLP Transformer},
117
+ author={Pacific Prime},
118
+ year={2025},
119
+ url={https://huggingface.co/Pacific-Prime/complexity}
120
+ }
121
+ ```