RtaForge
/

Anvaya-Rabbit-2.7B

@@ -14,7 +14,8 @@ base_model: RtaForge/Anvaya-Rabbit-2.7B
 # Anvaya-Rabbit 2.7B — v0.1 Alpha
-**The architecture, training protocol, and infrastructure are the story.**
 Rabbit is the first model in the Anvaya series — a proof of concept demonstrating
 that a fully custom State-Space Model (SSM) can be trained from scratch, on a
 single consumer-grade GPU, with no dependence on attention or transformer
@@ -61,14 +62,25 @@ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
 > guha@rtaforge.in for access). This model uses a custom SSM architecture
 > not compatible with standard HuggingFace `AutoModel`.
 ## Training
-Trained with the Anvaya Gurukul protocol: a constitutional Sisya/Guru loop
-where Sisya proposes weight deltas and Guru applies them after validation.
-SFT imprint applied using surface-only gate-layer fine-tuning (65 examples, 3 epochs).
-**1,500 accepted proposals across 6 phases on a single AceCloud L4 (24GB VRAM).
-~7 days of effective training time (total elapsed higher due to crash recovery and VRAM leak debugging).**
 | Phase | Proposals | Dataset | Focus |
 |-------|-----------|---------|-------|
@@ -81,6 +93,9 @@ SFT imprint applied using surface-only gate-layer fine-tuning (65 examples, 3 ep
 **Final checkpoint: Step 1,500.** seq_len=64, batch_size=3, optimizer=Lion, lr=1e-5.
 ## Evaluation Results (Step 1,500)
 ### Internal — Scale-Invariant Metrics
@@ -103,18 +118,19 @@ capability.
 ### Commercial Benchmarks (lm-eval harness)
-> **Important caveat**: Rabbit was trained at seq_len=64. Standard lm-eval prompts
-> (few-shot examples + question) typically run 150–400 tokens. The scores below
-> reflect inference at context lengths the model was never trained on.
-> Raccoon (seq_len=512) will be evaluated without this constraint.
 | Benchmark | Score | Notes |
 |-----------|-------|-------|
-| HellaSwag | 25.89% | Near-random; context length exceeds training seq_len |
-| ARC-Challenge | 26.71% | Near-random; context length exceeds training seq_len |
-| MMLU | 26.89% | Near-random; 5-shot prompts well beyond training seq_len |
-| WinoGrande | 48.62% | Near-random |
-| TruthfulQA MC1 | 21.91% | — |
 ## What Comes Next
@@ -125,5 +141,5 @@ capability.
 | **Polar Bear** | ~13B | 512 | Planned — STEM + AEVA anti-hallucination layer |
 The delta between Rabbit and Raccoon is the story. One epoch → two epochs,
-seq_len 64 → 512. Same pipeline, same hardware philosophy.
 **Give us more resources and watch what happens.**

 # Anvaya-Rabbit 2.7B — v0.1 Alpha
+> **The architecture, training protocol, and infrastructure are the story.**
 Rabbit is the first model in the Anvaya series — a proof of concept demonstrating
 that a fully custom State-Space Model (SSM) can be trained from scratch, on a
 single consumer-grade GPU, with no dependence on attention or transformer
 > guha@rtaforge.in for access). This model uses a custom SSM architecture
 > not compatible with standard HuggingFace `AutoModel`.
+**Training infrastructure**: [`Rta-Forge/polaris-revival`](https://github.com/Rta-Forge/polaris-revival) —
+patched ROCm 7.2 runtime restoring native HIP dispatch on gfx803 (RX 560X), with
+fused SSM recurrence kernels. MIT licensed.
 ## Training
+Two proprietary components make this training regime possible:
+- **Subsuminator** — migrates learned weights across architectures without
+  retraining from scratch, enabling efficient curriculum transfer.
+- **Gurukul** — a constitutional Sisya/Guru proposal-validation loop. Sisya
+  proposes weight deltas; Guru validates them against constitutional constraints
+  before applying. Strong learning signals extracted from limited data and compute.
+Together they are why Rabbit trained in 7 days on a single consumer GPU.
+**1,500 accepted Gurukul proposals across 6 phases on a single AceCloud L4 (24GB VRAM).
+~7 days effective training time (total elapsed higher due to crash recovery and VRAM
+leak debugging).**
 | Phase | Proposals | Dataset | Focus |
 |-------|-----------|---------|-------|
 **Final checkpoint: Step 1,500.** seq_len=64, batch_size=3, optimizer=Lion, lr=1e-5.
+SFT imprint applied using surface-only gate-layer fine-tuning (65 examples, 3 epochs),
+trained with the Anvaya Gurukul protocol.
 ## Evaluation Results (Step 1,500)
 ### Internal — Scale-Invariant Metrics
 ### Commercial Benchmarks (lm-eval harness)
+> **Standard academic benchmarks are not yet meaningful here.** Rabbit was
+> deliberately trained at seq_len=64 as a pure architecture proof. Standard
+> lm-eval prompts (few-shot examples + question) run 150–400 tokens — well
+> beyond Rabbit's training context. Raccoon (seq_len=512) removes this
+> constraint entirely.
 | Benchmark | Score | Notes |
 |-----------|-------|-------|
+| HellaSwag | 25.89% | Prompt exceeds training seq_len |
+| ARC-Challenge | 26.71% | Prompt exceeds training seq_len |
+| MMLU | 26.89% | Prompt exceeds training seq_len |
+| WinoGrande | 48.62% | Prompt exceeds training seq_len |
+| TruthfulQA MC1 | 21.91% | Prompt exceeds training seq_len |
 ## What Comes Next
 | **Polar Bear** | ~13B | 512 | Planned — STEM + AEVA anti-hallucination layer |
 The delta between Rabbit and Raccoon is the story. One epoch → two epochs,
+seq_len 64 → 512, 2.7B → 6.1B. Same pipeline, same hardware philosophy.
 **Give us more resources and watch what happens.**