Add model card with architecture details, provenance, and training metrics

Browse files

Files changed (1) hide show

README.md +223 -0

README.md ADDED Viewed

	@@ -0,0 +1,223 @@

+---
+language:
+  - en
+license: mit
+library_name: lux
+tags:
+  - julia
+  - lux
+  - slm
+  - philosophy
+  - symbiogenesis
+  - monarch-mixer
+  - long-convolution
+  - causal-conv
+  - rmsnorm
+  - swiglu
+  - bpe
+  - text-generation
+pipeline_tag: text-generation
+model-index:
+  - name: SymbioSLM
+    results:
+      - task:
+          type: text-generation
+          name: Text Generation
+        dataset:
+          type: LisaMegaWatts/philosophy-corpus
+          name: philosophy-corpus
+        metrics:
+          - type: perplexity
+            value: 79.9
+            name: Val PPL (step 1000)
+---
+# SymbioSLM
+A ~5M parameter decoder-only language model using the **Symbiogenesis** architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.
+## Architecture
+Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:
+```
+SymbioBlock (x6)
++-- RMSNorm
++-- SymbioSequenceMixer
+|   +-- Organelle 1: CausalDepthwiseConv1d   (local n-gram patterns, K=4)
+|   +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
+|   +-- Organelle 3: LongConv                (global dense causal filter)
+|   +-- OrganelleGate                        (per-channel softmax fusion)
++-- RMSNorm
++-- SwiGLU FFN
+```
+### How It Works
+1. **CausalConv** captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
+2. **Monarch matrices** provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
+3. **LongConv** learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
+4. **OrganelleGate** fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.
+No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.
+## Model Details
+| Parameter | Value |
+|---|---|
+| Architecture | Symbiogenesis (3 organelles + gate) |
+| Parameters | ~4.1M |
+| Embed dim | 256 |
+| Layers | 6 |
+| Monarch heads | 4 |
+| Context length | 256 tokens |
+| Vocabulary | 2,000 (ByteLevel BPE) |
+| FFN | SwiGLU (hidden=640) |
+| Normalization | RMSNorm (pre-norm) |
+| Weight tying | Yes (shared input/output embeddings) |
+| Precision | Float32 (F16 slower for Monarch block sizes) |
+### Parameter Breakdown
+| Component | Params | % |
+|---|---|---|
+| Token embedding (tied) | 512K | 12.6% |
+| CausalConv (x6) | 6.1K | 0.2% |
+| Monarch heads (x6, 4 heads each) | 197K | 4.8% |
+| LongConv (x6) | 393K | 9.7% |
+| OrganelleGate (x6) | 4.6K | 0.1% |
+| SwiGLU FFN (x6) | 2.95M | 72.6% |
+| RMSNorm (x13) | 3.3K | <0.1% |
+| **Total** | **~4.1M** | |
+### Sequence Mixing Efficiency
+| | Transformer | Monarch | Symbiogenesis |
+|---|---|---|---|
+| Seq mixer params/block | 262K | 67K | 100K |
+| Reduction vs Transformer | - | 74% | **62%** |
+| Position encoding | RoPE (separate) | None | None |
+## Training
+| | Value |
+|---|---|
+| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
+| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
+| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
+| Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
+| Batch size | 32 |
+| Hardware | NVIDIA RTX 3060 12GB |
+| Throughput | ~19K tok/s (Float32) |
+| Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |
+### Training Progress (partial)
+| Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
+|---|---|---|---|---|
+| 1 | 17.10 | 17.03 | 24.9M | 1.099 |
+| 500 | 6.50 | 4.92 | 137.5 | 1.098 |
+| 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |
+### Gelation Monitoring
+Training includes phase transition detection inspired by polymer physics:
+- **CUSUM on loss curvature**: Detects sudden changes in 2nd derivative of loss curve
+- **Gate entropy**: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
+- **Kuramoto order parameter**: Measures synchronization of block dynamics (R > 0.9 = gelation)
+## Comparison with Other Julia SLM Variants
+| | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | **SymbioSLM** |
+|---|---|---|---|
+| Architecture | Transformer | Monarch Mixer | Symbiogenesis |
+| Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
+| Parameters | 5.04M | 4.98M | ~4.1M |
+| Layers | 6 | 8 | 6 |
+| Val PPL | **34.5** | 38.4 | TBD |
+| Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
+| Position encoding | RoPE | None | None |
+## Usage
+### Generate with Julia
+```julia
+using Pkg; Pkg.activate("julia-slm")
+include("src/JuliaGPT.jl")
+using .JuliaGPT
+using .JuliaGPT: Lux, CUDA
+tok = BPETokenizer("vocab.json", "merges.txt")
+device = Lux.gpu_device()
+ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)
+model = create_model(ModelConfig(;
+    arch="symbiogenesis", vocab_size=vocab_size(tok),
+    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
+    n_monarch_heads=4, conv_kernel_size=4,
+    ffn_mult=4, context_length=256, weight_tying=true,
+))
+text = generate(model, ps, st, tok, "the nature of ";
+    max_new_tokens=200, temperature=0.8, top_k=40)
+println(text)
+```
+### OpenAI-Compatible API
+The model is served via [SymbioSLM Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM):
+```bash
+curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": "the nature of"}],
+    "max_tokens": 200,
+    "temperature": 0.8,
+    "top_k": 40
+  }'
+```
+Streaming supported with `"stream": true`.
+## Files
+| File | Description |
+|---|---|
+| `final.jld2` | Trained model parameters (JLD2 format) |
+| `config.toml` | Model architecture configuration |
+| `vocab.json` | BPE vocabulary (2000 tokens) |
+| `merges.txt` | BPE merge rules |
+## Biological Inspiration
+The architecture is named after Lynn Margulis' theory of **symbiogenesis** (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.
+Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.
+## Citation
+```bibtex
+@misc{symbioslm2026,
+  title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
+  author={LisaMegaWatts},
+  year={2026},
+  url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
+}
+```
+## References
+- Margulis, L. (1967). On the origin of mitosing cells. *J. Theoretical Biology*, 14(3), 225-274.
+- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
+- Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
+- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
+## License
+MIT