docs: improved model card with training narrative and polaris-revival link
Browse files
README.md
CHANGED
|
@@ -14,7 +14,8 @@ base_model: RtaForge/Anvaya-Rabbit-2.7B
|
|
| 14 |
|
| 15 |
# Anvaya-Rabbit 2.7B β v0.1 Alpha
|
| 16 |
|
| 17 |
-
**The architecture, training protocol, and infrastructure are the story.**
|
|
|
|
| 18 |
Rabbit is the first model in the Anvaya series β a proof of concept demonstrating
|
| 19 |
that a fully custom State-Space Model (SSM) can be trained from scratch, on a
|
| 20 |
single consumer-grade GPU, with no dependence on attention or transformer
|
|
@@ -61,14 +62,25 @@ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
|
| 61 |
> guha@rtaforge.in for access). This model uses a custom SSM architecture
|
| 62 |
> not compatible with standard HuggingFace `AutoModel`.
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
## Training
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
| Phase | Proposals | Dataset | Focus |
|
| 74 |
|-------|-----------|---------|-------|
|
|
@@ -81,6 +93,9 @@ SFT imprint applied using surface-only gate-layer fine-tuning (65 examples, 3 ep
|
|
| 81 |
|
| 82 |
**Final checkpoint: Step 1,500.** seq_len=64, batch_size=3, optimizer=Lion, lr=1e-5.
|
| 83 |
|
|
|
|
|
|
|
|
|
|
| 84 |
## Evaluation Results (Step 1,500)
|
| 85 |
|
| 86 |
### Internal β Scale-Invariant Metrics
|
|
@@ -103,18 +118,19 @@ capability.
|
|
| 103 |
|
| 104 |
### Commercial Benchmarks (lm-eval harness)
|
| 105 |
|
| 106 |
-
> **
|
| 107 |
-
>
|
| 108 |
-
>
|
| 109 |
-
> Raccoon (seq_len=512)
|
|
|
|
| 110 |
|
| 111 |
| Benchmark | Score | Notes |
|
| 112 |
|-----------|-------|-------|
|
| 113 |
-
| HellaSwag | 25.89% |
|
| 114 |
-
| ARC-Challenge | 26.71% |
|
| 115 |
-
| MMLU | 26.89% |
|
| 116 |
-
| WinoGrande | 48.62% |
|
| 117 |
-
| TruthfulQA MC1 | 21.91% |
|
| 118 |
|
| 119 |
## What Comes Next
|
| 120 |
|
|
@@ -125,5 +141,5 @@ capability.
|
|
| 125 |
| **Polar Bear** | ~13B | 512 | Planned β STEM + AEVA anti-hallucination layer |
|
| 126 |
|
| 127 |
The delta between Rabbit and Raccoon is the story. One epoch β two epochs,
|
| 128 |
-
seq_len 64 β 512. Same pipeline, same hardware philosophy.
|
| 129 |
**Give us more resources and watch what happens.**
|
|
|
|
| 14 |
|
| 15 |
# Anvaya-Rabbit 2.7B β v0.1 Alpha
|
| 16 |
|
| 17 |
+
> **The architecture, training protocol, and infrastructure are the story.**
|
| 18 |
+
|
| 19 |
Rabbit is the first model in the Anvaya series β a proof of concept demonstrating
|
| 20 |
that a fully custom State-Space Model (SSM) can be trained from scratch, on a
|
| 21 |
single consumer-grade GPU, with no dependence on attention or transformer
|
|
|
|
| 62 |
> guha@rtaforge.in for access). This model uses a custom SSM architecture
|
| 63 |
> not compatible with standard HuggingFace `AutoModel`.
|
| 64 |
|
| 65 |
+
**Training infrastructure**: [`Rta-Forge/polaris-revival`](https://github.com/Rta-Forge/polaris-revival) β
|
| 66 |
+
patched ROCm 7.2 runtime restoring native HIP dispatch on gfx803 (RX 560X), with
|
| 67 |
+
fused SSM recurrence kernels. MIT licensed.
|
| 68 |
+
|
| 69 |
## Training
|
| 70 |
|
| 71 |
+
Two proprietary components make this training regime possible:
|
| 72 |
+
|
| 73 |
+
- **Subsuminator** β migrates learned weights across architectures without
|
| 74 |
+
retraining from scratch, enabling efficient curriculum transfer.
|
| 75 |
+
- **Gurukul** β a constitutional Sisya/Guru proposal-validation loop. Sisya
|
| 76 |
+
proposes weight deltas; Guru validates them against constitutional constraints
|
| 77 |
+
before applying. Strong learning signals extracted from limited data and compute.
|
| 78 |
|
| 79 |
+
Together they are why Rabbit trained in 7 days on a single consumer GPU.
|
| 80 |
+
|
| 81 |
+
**1,500 accepted Gurukul proposals across 6 phases on a single AceCloud L4 (24GB VRAM).
|
| 82 |
+
~7 days effective training time (total elapsed higher due to crash recovery and VRAM
|
| 83 |
+
leak debugging).**
|
| 84 |
|
| 85 |
| Phase | Proposals | Dataset | Focus |
|
| 86 |
|-------|-----------|---------|-------|
|
|
|
|
| 93 |
|
| 94 |
**Final checkpoint: Step 1,500.** seq_len=64, batch_size=3, optimizer=Lion, lr=1e-5.
|
| 95 |
|
| 96 |
+
SFT imprint applied using surface-only gate-layer fine-tuning (65 examples, 3 epochs),
|
| 97 |
+
trained with the Anvaya Gurukul protocol.
|
| 98 |
+
|
| 99 |
## Evaluation Results (Step 1,500)
|
| 100 |
|
| 101 |
### Internal β Scale-Invariant Metrics
|
|
|
|
| 118 |
|
| 119 |
### Commercial Benchmarks (lm-eval harness)
|
| 120 |
|
| 121 |
+
> **Standard academic benchmarks are not yet meaningful here.** Rabbit was
|
| 122 |
+
> deliberately trained at seq_len=64 as a pure architecture proof. Standard
|
| 123 |
+
> lm-eval prompts (few-shot examples + question) run 150β400 tokens β well
|
| 124 |
+
> beyond Rabbit's training context. Raccoon (seq_len=512) removes this
|
| 125 |
+
> constraint entirely.
|
| 126 |
|
| 127 |
| Benchmark | Score | Notes |
|
| 128 |
|-----------|-------|-------|
|
| 129 |
+
| HellaSwag | 25.89% | Prompt exceeds training seq_len |
|
| 130 |
+
| ARC-Challenge | 26.71% | Prompt exceeds training seq_len |
|
| 131 |
+
| MMLU | 26.89% | Prompt exceeds training seq_len |
|
| 132 |
+
| WinoGrande | 48.62% | Prompt exceeds training seq_len |
|
| 133 |
+
| TruthfulQA MC1 | 21.91% | Prompt exceeds training seq_len |
|
| 134 |
|
| 135 |
## What Comes Next
|
| 136 |
|
|
|
|
| 141 |
| **Polar Bear** | ~13B | 512 | Planned β STEM + AEVA anti-hallucination layer |
|
| 142 |
|
| 143 |
The delta between Rabbit and Raccoon is the story. One epoch β two epochs,
|
| 144 |
+
seq_len 64 β 512, 2.7B β 6.1B. Same pipeline, same hardware philosophy.
|
| 145 |
**Give us more resources and watch what happens.**
|