Pacific-Prime
/

complexity-tiny

@@ -1,121 +1,121 @@
----
-license: cc-by-nc-4.0
-language:
-- en
-- fr
-- code
-tags:
-- complexity
-- token-routed-mlp
-- flash-attention
-- causal-lm
-library_name: transformers
-pipeline_tag: text-generation
----
-# Complexity Base
-A Llama-style transformer with architectural improvements for efficiency and performance.
-## Architecture: Llama + Improvements
-Complexity builds on the Llama architecture with three key enhancements:
-| Component | Llama | Complexity |
-|-----------|-------|------------|
-| **MLP** | Dense FFN | **Token-Routed MLP** (4 experts, 1 active) |
-| **Attention** | Standard | **Flash Attention** via SDPA |
-| **Normalization** | RMSNorm only | RMSNorm + **QK Normalization** |
-### Token-Routed MLP
-Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on **token ID**:
-```python
-expert_idx = token_id % num_experts  # Deterministic routing
-output = experts[expert_idx](hidden_states)
-```
-**Benefits:**
-- No router network overhead
-- Deterministic, reproducible routing
-- 4x parameter efficiency (only 1/4 experts active)
-### QK Normalization
-Stabilizes attention at scale by normalizing Q and K before computing attention scores:
-```python
-q = self.q_norm(q)
-k = self.k_norm(k)
-attn = (q @ k.T) / sqrt(d)
-```
-## Model Details
-- **Parameters**: ~100M
-- **Hidden size**: 768
-- **Layers**: 12
-- **Attention heads**: 12 (KV heads: 4)
-- **Experts**: 4 (1 active per token)
-- **Vocabulary**: 100K tokens
-- **Context**: 2048 tokens
-- **Training steps**: 10,000
-## Installation
-```bash
-pip install complexity-model pyllm-inference
-```
-## Usage
-### With PyLLM
-```bash
-pyllm serve Pacific-Prime/complexity
-```
-### Python API
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity")
-model = AutoModelForCausalLM.from_pretrained(
-    "Pacific-Prime/complexity",
-    trust_remote_code=True
-)
-inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0]))
-```
-## Comparison with Llama
-```
-Llama:      embed -> [Attn + FFN] x L -> output
-Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output
-                      ↑ QK Norm    ↑ 4 experts (1 active)
-```
-Same parameter count, but:
-- **4x more total MLP parameters** (distributed across experts)
-- **Faster training** (QK norm stabilizes gradients)
-- **Better scaling** (sparse activation)
-## License
-Apache 2.0
-## Citation
-```bibtex
-@misc{complexity,
-  title={Complexity: Token-Routed MLP Transformer},
-  author={Pacific Prime},
-  year={2025},
-  url={https://huggingface.co/Pacific-Prime/complexity}
-}
-```

+---
+license: cc-by-nc-4.0
+language:
+- en
+- fr
+- code
+tags:
+- complexity
+- token-routed-mlp
+- flash-attention
+- causal-lm
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Complexity Base
+A Llama-style transformer with architectural improvements for efficiency and performance.
+## Architecture: Llama + Improvements
+Complexity builds on the Llama architecture with three key enhancements:
+| Component | Llama | Complexity |
+|-----------|-------|------------|
+| **MLP** | Dense FFN | **Token-Routed MLP** (4 experts, 1 active) |
+| **Attention** | Standard | **Flash Attention** via SDPA |
+| **Normalization** | RMSNorm only | RMSNorm + **QK Normalization** |
+### Token-Routed MLP
+Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on **token ID**:
+```python
+expert_idx = token_id % num_experts  # Deterministic routing
+output = experts[expert_idx](hidden_states)
+```
+**Benefits:**
+- No router network overhead
+- Deterministic, reproducible routing
+- 4x parameter efficiency (only 1/4 experts active)
+### QK Normalization
+Stabilizes attention at scale by normalizing Q and K before computing attention scores:
+```python
+q = self.q_norm(q)
+k = self.k_norm(k)
+attn = (q @ k.T) / sqrt(d)
+```
+## Model Details
+- **Parameters**: ~100M
+- **Hidden size**: 768
+- **Layers**: 12
+- **Attention heads**: 12 (KV heads: 4)
+- **Experts**: 4 (1 active per token)
+- **Vocabulary**: 100K tokens
+- **Context**: 2048 tokens
+- **Training steps**: 10,000
+## Installation
+```bash
+pip install complexity-model pyllm-inference
+```
+## Usage
+### With PyLLM
+```bash
+pyllm serve Pacific-Prime/complexity-tiny
+```
+### Python API
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity")
+model = AutoModelForCausalLM.from_pretrained(
+    "Pacific-Prime/complexity",
+    trust_remote_code=True
+)
+inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0]))
+```
+## Comparison with Llama
+```
+Llama:      embed -> [Attn + FFN] x L -> output
+Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output
+                      ↑ QK Norm    ↑ 4 experts (1 active)
+```
+Same parameter count, but:
+- **4x more total MLP parameters** (distributed across experts)
+- **Faster training** (QK norm stabilizes gradients)
+- **Better scaling** (sparse activation)
+## License
+Apache 2.0
+## Citation
+```bibtex
+@misc{complexity,
+  title={Complexity: Token-Routed MLP Transformer},
+  author={Pacific Prime},
+  year={2025},
+  url={https://huggingface.co/Pacific-Prime/complexity}
+}
+```