README.md · LisaMegaWatts/SymbioSLM at main

SymbioSLM / README.md

LisaMegaWatts

Restore v1 model (val loss 3.62, full 12305-step training)

3e52795 verified 4 days ago

preview code

raw

history blame contribute delete

5.75 kB

	---
	language:
	- en
	license: mit
	library_name: lux
	tags:
	- julia
	- lux
	- slm
	- philosophy
	- symbiogenesis
	- monarch-mixer
	- long-convolution
	- causal-conv
	- rmsnorm
	- swiglu
	- bpe
	- text-generation
	- attention-free
	pipeline_tag: text-generation
	datasets:
	- LisaMegaWatts/philosophy-corpus
	model-index:
	- name: SymbioSLM
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: LisaMegaWatts/philosophy-corpus
	name: philosophy-corpus
	metrics:
	- type: perplexity
	value: 37.3
	name: Val PPL
	verified: false
	- type: loss
	value: 3.62
	name: Val Loss
	verified: false
	---

	# SymbioSLM

	A 5.05M parameter attention-free language model using the Symbiogenesis architecture — multi-organelle sequence mixing with learned per-channel gating. Trained on a philosophy corpus of 981 classical texts (~795M tokens).

	## Architecture

	Symbiogenesis replaces self-attention with three complementary "organelles" for sequence mixing, inspired by the biological theory of symbiogenesis (Margulis, 1967) — where complex organelles like mitochondria were once independent organisms that fused into eukaryotic cells.

	Each of the 8 SymbioBlocks contains:

	\| Organelle \| Function \| Scale \| Complexity \|
	\|-----------\|----------\|-------\|------------\|
	\| CausalDepthwiseConv1d \| Local n-gram pattern detection \| Local (kernel=4) \| O(n) \|
	\| Monarch Matrix \| Sub-quadratic global sequence mixing \| Global \| O(n√n) \|
	\| LongConv \| Dense causal convolution filtering \| Global \| O(n log n) \|

	An OrganelleGate (per-channel softmax) learns which organelle each embedding channel relies on, creating specialized "fused organisms" per block.

	### No Positional Encoding

	SymbioSLM requires no explicit positional encoding (no RoPE, no sinusoidal embeddings). The Monarch matrices and LongConv kernels implicitly learn position-dependent mixing patterns, while CausalConv captures local ordering through its convolutional structure.

	### Model Specifications

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| Symbiogenesis \|
	\| Parameters \| 5,052,672 (5.05M) \|
	\| Embedding dim \| 256 \|
	\| Layers \| 8 \|
	\| Monarch heads \| 1 per block \|
	\| Conv kernel \| 4 \|
	\| FFN \| SwiGLU (4x, 2/3 adjusted) \|
	\| Normalization \| RMSNorm (pre-norm) \|
	\| Context length \| 256 tokens \|
	\| Vocab size \| 2,000 (BPE) \|
	\| Weight tying \| Yes \|
	\| Free energy reg \| 0.001 \|

	### Parameter Breakdown

	\| Component \| Params \| % \|
	\|-----------\|--------\|---\|
	\| Token embedding \| 512,000 \| 10.1% \|
	\| SymbioBlocks (8x) \| 4,540,672 \| 89.9% \|
	\|    CausalConv \| ~8K/block \| \|
	\|    Monarch \| ~131K/block \| \|
	\|    LongConv \| ~65K/block \| \|
	\|    OrganelleGate \| ~769/block \| \|
	\|    SwiGLU FFN \| ~350K/block \| \|
	\|    RMSNorm (2x) \| ~512/block \| \|
	\| Final RMSNorm \| 256 \| <0.1% \|

	## Results

	Trained for 12,305 steps on an NVIDIA RTX 3060 (12GB).

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Val Loss \| 3.62 \|
	\| Val PPL \| 37.3 \|
	\| Training steps \| 12,305 \|
	\| Batch size \| 32 \|
	\| Precision \| Float16 (AMP) \|

	### Comparison with Other 5M Julia SLMs

	All models trained on the same philosophy corpus with identical tokenizer and training budget (12,305 steps):

	\| Model \| Architecture \| Params \| Val Loss \| Val PPL \|
	\|-------\|-------------\|--------\|----------\|---------\|
	\| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) \| Transformer (RoPE) \| 5.04M \| 3.54 \| 34.5 \|
	\| SymbioSLM \| Symbiogenesis \| 5.05M \| 3.62 \| 37.3 \|
	\| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) \| Monarch Mixer \| 5.04M \| 3.65 \| 38.4 \|

	SymbioSLM outperforms the Monarch-only baseline while using no attention mechanism. The multi-organelle fusion provides complementary mixing at different scales that a single mixer cannot achieve alone.

	## Training Configuration

	```toml
	[model]
	arch = "symbiogenesis"
	embed_dim = 256
	n_layers = 8
	n_monarch_heads = 1
	conv_kernel_size = 4
	ffn_mult = 4
	context_length = 256
	weight_tying = true
	free_energy_beta = 0.001

	[training]
	optimizer = "adamw"
	lr = 6e-4
	min_lr = 6e-5
	warmup_steps = 500
	max_steps = 12305
	batch_size = 32
	grad_clip = 1.0
	precision = "f16"
	```

	## Gelation Monitoring

	Training includes gelation monitoring via CUSUM change-point detection on gate entropy. This tracks when the organelle gates transition from uniform mixing to specialized configurations — a phase transition analogous to gel formation in polymer physics.

	## Usage

	### Julia (Lux.jl)

	```julia
	using JuliaGPT

	# Load model
	config = load_config("config.toml")
	model = create_model(config.model)
	ps, st, _, _, _ = load_checkpoint("final.jld2")

	# Load tokenizer
	tokenizer = BPETokenizer("vocab.json", "merges.txt")

	# Generate text
	prompt = "The nature of reality"
	output = generate(model, ps, st, tokenizer, prompt;
	max_new_tokens=200, temperature=0.8, top_k=40)
	println(output)
	```

	## References

	- Symbiogenesis framework: [DavinciDreams/symbiogenesis](https://github.com/DavinciDreams/symbiogenesis) — Evolutionary NAS via organism fusion
	- Monarch Mixer: Dao et al., 2023 — Sub-quadratic GEMM-based sequence mixing
	- Hyena: Poli et al., 2023 — Long convolutions for sequence modeling
	- Endosymbiotic theory: Margulis, 1967 — Origin of eukaryotic organelles

	## Citation

	```bibtex
	@misc{symbio-slm-2026,
	title={SymbioSLM: Multi-Organelle Sequence Mixing for Attention-Free Language Modeling},
	author={LisaMegaWatts},
	year={2026},
	url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
	}
	```

	## License

	MIT