Pacific-Prime
/

complexity-tiny

Text Generation

token-routed-mlp

flash-attention

Model card Files Files and versions

complexity-tiny / README.md

Pacific-Prime's picture

Update README.md

95b7c5f verified 14 days ago

|

history blame contribute delete

2.81 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	- fr
	- code
	tags:
	- complexity
	- token-routed-mlp
	- flash-attention
	- causal-lm
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Complexity Base

	A Llama-style transformer with architectural improvements for efficiency and performance.

	## Architecture: Llama + Improvements

	Complexity builds on the Llama architecture with three key enhancements:

	\| Component \| Llama \| Complexity \|
	\|-----------\|-------\|------------\|
	\| MLP \| Dense FFN \| Token-Routed MLP (4 experts, 1 active) \|
	\| Attention \| Standard \| Flash Attention via SDPA \|
	\| Normalization \| RMSNorm only \| RMSNorm + QK Normalization \|

	### Token-Routed MLP

	Unlike MoE which routes based on hidden states, Token-Routed MLP routes based on token ID:

	```python
	expert_idx = token_id % num_experts # Deterministic routing
	output = experts[expert_idx](hidden_states)
	```

	Benefits:
	- No router network overhead
	- Deterministic, reproducible routing
	- 4x parameter efficiency (only 1/4 experts active)

	### QK Normalization

	Stabilizes attention at scale by normalizing Q and K before computing attention scores:

	```python
	q = self.q_norm(q)
	k = self.k_norm(k)
	attn = (q @ k.T) / sqrt(d)
	```

	## Model Details

	- Parameters: ~100M
	- Hidden size: 768
	- Layers: 12
	- Attention heads: 12 (KV heads: 4)
	- Experts: 4 (1 active per token)
	- Vocabulary: 100K tokens
	- Context: 2048 tokens
	- Training steps: 10,000

	## Installation

	```bash
	pip install complexity-model pyllm-inference
	```

	## Usage

	### With PyLLM

	```bash
	pyllm serve Pacific-Prime/complexity-tiny
	```

	### Python API

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("Pacific-Prime/complexity")
	model = AutoModelForCausalLM.from_pretrained(
	"Pacific-Prime/complexity",
	trust_remote_code=True
	)

	inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0]))
	```

	## Comparison with Llama

	```
	Llama: embed -> [Attn + FFN] x L -> output
	Complexity: embed -> [Attn + TokenRoutedMLP] x L -> output
	↑ QK Norm ↑ 4 experts (1 active)
	```

	Same parameter count, but:
	- 4x more total MLP parameters (distributed across experts)
	- Faster training (QK norm stabilizes gradients)
	- Better scaling (sparse activation)

	## License

	Apache 2.0

	## Links

	- [GitHub](https://github.com/Complexity-ML/complexity-framework)
	- [PyPI](https://pypi.org/project/complexity-framework/)

	## Citation

	```bibtex
	@misc{complexity,
	title={Complexity: Token-Routed MLP Transformer},
	author={Pacific Prime},
	year={2025},
	url={https://huggingface.co/Pacific-Prime/complexity}
	}
	```