Restore v1 model (val loss 3.62, full 12305-step training)

Browse files

Files changed (1) hide show

README.md +114 -146

README.md CHANGED Viewed

@@ -16,7 +16,10 @@ tags:
   - swiglu
   - bpe
   - text-generation
 pipeline_tag: text-generation
 model-index:
   - name: SymbioSLM
     results:
@@ -28,196 +31,161 @@ model-index:
           name: philosophy-corpus
         metrics:
           - type: perplexity
-            value: 79.9
-            name: Val PPL (step 1000)
 ---
 # SymbioSLM
-A ~5M parameter decoder-only language model using the **Symbiogenesis** architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.
 ## Architecture
-Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:
-```
-SymbioBlock (x6)
-+-- RMSNorm
-+-- SymbioSequenceMixer
-|   +-- Organelle 1: CausalDepthwiseConv1d   (local n-gram patterns, K=4)
-|   +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
-|   +-- Organelle 3: LongConv                (global dense causal filter)
-|   +-- OrganelleGate                        (per-channel softmax fusion)
-+-- RMSNorm
-+-- SwiGLU FFN
-```
-### How It Works
-1. **CausalConv** captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
-2. **Monarch matrices** provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
-3. **LongConv** learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
-4. **OrganelleGate** fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.
-No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.
-## Model Details
 | Parameter | Value |
-|---|---|
-| Architecture | Symbiogenesis (3 organelles + gate) |
-| Parameters | ~4.1M |
-| Embed dim | 256 |
-| Layers | 6 |
-| Monarch heads | 4 |
-| Context length | 256 tokens |
-| Vocabulary | 2,000 (ByteLevel BPE) |
-| FFN | SwiGLU (hidden=640) |
 | Normalization | RMSNorm (pre-norm) |
-| Weight tying | Yes (shared input/output embeddings) |
-| Precision | Float32 (F16 slower for Monarch block sizes) |
 ### Parameter Breakdown
 | Component | Params | % |
-|---|---|---|
-| Token embedding (tied) | 512K | 12.6% |
-| CausalConv (x6) | 6.1K | 0.2% |
-| Monarch heads (x6, 4 heads each) | 197K | 4.8% |
-| LongConv (x6) | 393K | 9.7% |
-| OrganelleGate (x6) | 4.6K | 0.1% |
-| SwiGLU FFN (x6) | 2.95M | 72.6% |
-| RMSNorm (x13) | 3.3K | <0.1% |
-| **Total** | **~4.1M** | |
-### Sequence Mixing Efficiency
-| | Transformer | Monarch | Symbiogenesis |
-|---|---|---|---|
-| Seq mixer params/block | 262K | 67K | 100K |
-| Reduction vs Transformer | - | 74% | **62%** |
-| Position encoding | RoPE (separate) | None | None |
-## Training
-| | Value |
-|---|---|
-| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
-| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
-| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
-| Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
 | Batch size | 32 |
-| Hardware | NVIDIA RTX 3060 12GB |
-| Throughput | ~19K tok/s (Float32) |
-| Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |
-### Training Progress (partial)
-| Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
-|---|---|---|---|---|
-| 1 | 17.10 | 17.03 | 24.9M | 1.099 |
-| 500 | 6.50 | 4.92 | 137.5 | 1.098 |
-| 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |
-### Gelation Monitoring
-Training includes phase transition detection inspired by polymer physics:
-- **CUSUM on loss curvature**: Detects sudden changes in 2nd derivative of loss curve
-- **Gate entropy**: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
-- **Kuramoto order parameter**: Measures synchronization of block dynamics (R > 0.9 = gelation)
-## Comparison with Other Julia SLM Variants
-| | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | **SymbioSLM** |
-|---|---|---|---|
-| Architecture | Transformer | Monarch Mixer | Symbiogenesis |
-| Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
-| Parameters | 5.04M | 4.98M | ~4.1M |
-| Layers | 6 | 8 | 6 |
-| Val PPL | **34.5** | 38.4 | TBD |
-| Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
-| Position encoding | RoPE | None | None |
 ## Usage
-### Generate with Julia
 ```julia
-using Pkg; Pkg.activate("julia-slm")
-include("src/JuliaGPT.jl")
-using .JuliaGPT
-using .JuliaGPT: Lux, CUDA
-tok = BPETokenizer("vocab.json", "merges.txt")
-device = Lux.gpu_device()
-ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)
-model = create_model(ModelConfig(;
-    arch="symbiogenesis", vocab_size=vocab_size(tok),
-    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
-    n_monarch_heads=4, conv_kernel_size=4,
-    ffn_mult=4, context_length=256, weight_tying=true,
-))
-text = generate(model, ps, st, tok, "the nature of ";
-    max_new_tokens=200, temperature=0.8, top_k=40)
-println(text)
-```
-### OpenAI-Compatible API
-The model is served via [SymbioSLM Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM):
-```bash
-curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "messages": [{"role": "user", "content": "the nature of"}],
-    "max_tokens": 200,
-    "temperature": 0.8,
-    "top_k": 40
-  }'
 ```
-Streaming supported with `"stream": true`.
-## Files
-| File | Description |
-|---|---|
-| `final.jld2` | Trained model parameters (JLD2 format) |
-| `config.toml` | Model architecture configuration |
-| `vocab.json` | BPE vocabulary (2000 tokens) |
-| `merges.txt` | BPE merge rules |
-## Biological Inspiration
-The architecture is named after Lynn Margulis' theory of **symbiogenesis** (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.
-Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.
 ## Citation
 ```bibtex
-@misc{symbioslm2026,
-  title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
   author={LisaMegaWatts},
   year={2026},
   url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
 }
 ```
-## References
-- Margulis, L. (1967). On the origin of mitosing cells. *J. Theoretical Biology*, 14(3), 225-274.
-- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
-- Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
-- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
 ## License
 MIT

   - swiglu
   - bpe
   - text-generation
+  - attention-free
 pipeline_tag: text-generation
+datasets:
+  - LisaMegaWatts/philosophy-corpus
 model-index:
   - name: SymbioSLM
     results:
           name: philosophy-corpus
         metrics:
           - type: perplexity
+            value: 37.3
+            name: Val PPL
+            verified: false
+          - type: loss
+            value: 3.62
+            name: Val Loss
+            verified: false
 ---
 # SymbioSLM
+A **5.05M parameter** attention-free language model using the **Symbiogenesis** architecture — multi-organelle sequence mixing with learned per-channel gating. Trained on a philosophy corpus of 981 classical texts (~795M tokens).
 ## Architecture
+Symbiogenesis replaces self-attention with three complementary "organelles" for sequence mixing, inspired by the biological theory of symbiogenesis (Margulis, 1967) — where complex organelles like mitochondria were once independent organisms that fused into eukaryotic cells.
+Each of the 8 SymbioBlocks contains:
+| Organelle | Function | Scale | Complexity |
+|-----------|----------|-------|------------|
+| **CausalDepthwiseConv1d** | Local n-gram pattern detection | Local (kernel=4) | O(n) |
+| **Monarch Matrix** | Sub-quadratic global sequence mixing | Global | O(n&radic;n) |
+| **LongConv** | Dense causal convolution filtering | Global | O(n log n) |
+An **OrganelleGate** (per-channel softmax) learns which organelle each embedding channel relies on, creating specialized "fused organisms" per block.
+### No Positional Encoding
+SymbioSLM requires **no explicit positional encoding** (no RoPE, no sinusoidal embeddings). The Monarch matrices and LongConv kernels implicitly learn position-dependent mixing patterns, while CausalConv captures local ordering through its convolutional structure.
+### Model Specifications
 | Parameter | Value |
+|-----------|-------|
+| Architecture | Symbiogenesis |
+| Parameters | 5,052,672 (5.05M) |
+| Embedding dim | 256 |
+| Layers | 8 |
+| Monarch heads | 1 per block |
+| Conv kernel | 4 |
+| FFN | SwiGLU (4x, 2/3 adjusted) |
 | Normalization | RMSNorm (pre-norm) |
+| Context length | 256 tokens |
+| Vocab size | 2,000 (BPE) |
+| Weight tying | Yes |
+| Free energy reg | 0.001 |
 ### Parameter Breakdown
 | Component | Params | % |
+|-----------|--------|---|
+| Token embedding | 512,000 | 10.1% |
+| SymbioBlocks (8x) | 4,540,672 | 89.9% |
+| &nbsp;&nbsp; CausalConv | ~8K/block | |
+| &nbsp;&nbsp; Monarch | ~131K/block | |
+| &nbsp;&nbsp; LongConv | ~65K/block | |
+| &nbsp;&nbsp; OrganelleGate | ~769/block | |
+| &nbsp;&nbsp; SwiGLU FFN | ~350K/block | |
+| &nbsp;&nbsp; RMSNorm (2x) | ~512/block | |
+| Final RMSNorm | 256 | <0.1% |
+## Results
+Trained for 12,305 steps on an NVIDIA RTX 3060 (12GB).
+| Metric | Value |
+|--------|-------|
+| **Val Loss** | **3.62** |
+| **Val PPL** | **37.3** |
+| Training steps | 12,305 |
 | Batch size | 32 |
+| Precision | Float16 (AMP) |
+### Comparison with Other 5M Julia SLMs
+All models trained on the same philosophy corpus with identical tokenizer and training budget (12,305 steps):
+| Model | Architecture | Params | Val Loss | Val PPL |
+|-------|-------------|--------|----------|---------|
+| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer (RoPE) | 5.04M | **3.54** | **34.5** |
+| **SymbioSLM** | **Symbiogenesis** | **5.05M** | **3.62** | **37.3** |
+| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 5.04M | 3.65 | 38.4 |
+SymbioSLM outperforms the Monarch-only baseline while using no attention mechanism. The multi-organelle fusion provides complementary mixing at different scales that a single mixer cannot achieve alone.
+## Training Configuration
+```toml
+[model]
+arch = "symbiogenesis"
+embed_dim = 256
+n_layers = 8
+n_monarch_heads = 1
+conv_kernel_size = 4
+ffn_mult = 4
+context_length = 256
+weight_tying = true
+free_energy_beta = 0.001
+[training]
+optimizer = "adamw"
+lr = 6e-4
+min_lr = 6e-5
+warmup_steps = 500
+max_steps = 12305
+batch_size = 32
+grad_clip = 1.0
+precision = "f16"
+```
+## Gelation Monitoring
+Training includes gelation monitoring via CUSUM change-point detection on gate entropy. This tracks when the organelle gates transition from uniform mixing to specialized configurations — a phase transition analogous to gel formation in polymer physics.
 ## Usage
+### Julia (Lux.jl)
 ```julia
+using JuliaGPT
+# Load model
+config = load_config("config.toml")
+model = create_model(config.model)
+ps, st, _, _, _ = load_checkpoint("final.jld2")
+# Load tokenizer
+tokenizer = BPETokenizer("vocab.json", "merges.txt")
+# Generate text
+prompt = "The nature of reality"
+output = generate(model, ps, st, tokenizer, prompt;
+                  max_new_tokens=200, temperature=0.8, top_k=40)
+println(output)
 ```
+## References
+- **Symbiogenesis framework**: [DavinciDreams/symbiogenesis](https://github.com/DavinciDreams/symbiogenesis) — Evolutionary NAS via organism fusion
+- **Monarch Mixer**: Dao et al., 2023 — Sub-quadratic GEMM-based sequence mixing
+- **Hyena**: Poli et al., 2023 — Long convolutions for sequence modeling
+- **Endosymbiotic theory**: Margulis, 1967 — Origin of eukaryotic organelles
 ## Citation
 ```bibtex
+@misc{symbio-slm-2026,
+  title={SymbioSLM: Multi-Organelle Sequence Mixing for Attention-Free Language Modeling},
   author={LisaMegaWatts},
   year={2026},
   url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
 }
 ```
 ## License
 MIT