full model card rewrite — Rabbit v0.1 Alpha
Browse files
README.md
CHANGED
|
@@ -6,53 +6,114 @@ tags:
|
|
| 6 |
- ssm
|
| 7 |
- state-space-model
|
| 8 |
- causal-lm
|
| 9 |
-
-
|
| 10 |
- rtaforge
|
| 11 |
-
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# Anvaya-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## Architecture
|
| 20 |
|
| 21 |
-
- **Type**: Ṛta-SSM v7.2.2
|
| 22 |
- **Parameters**: ~2.78B
|
| 23 |
- **Layers**: 64
|
| 24 |
- **d_model / d_state**: 2560
|
| 25 |
- **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
|
| 26 |
- **Precision**: bfloat16
|
|
|
|
| 27 |
|
| 28 |
## Weights
|
| 29 |
|
| 30 |
-
This repository contains
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```python
|
| 35 |
-
import torch
|
| 36 |
from white_rabbit.rabbit_model import create_rabbit_model
|
|
|
|
|
|
|
| 37 |
|
| 38 |
model = create_rabbit_model(vocab_size=50280, durga_variant="fu-64")
|
| 39 |
-
sd = torch.load("
|
| 40 |
-
model.load_state_dict(sd, strict=
|
| 41 |
model.eval()
|
|
|
|
|
|
|
| 42 |
```
|
| 43 |
|
| 44 |
-
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|------|--------|-------|
|
| 48 |
-
| HellaSwag | acc_norm | 25.89% |
|
| 49 |
-
| ARC-Challenge | acc_norm | 26.71% |
|
| 50 |
-
| MMLU | acc | 26.89% |
|
| 51 |
-
| WinoGrande | acc | 48.62% |
|
| 52 |
-
| TruthfulQA MC1 | acc | 21.91% |
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
Trained with the Anvaya Gurukul protocol: a constitutional Sisya/Guru loop
|
| 57 |
where Sisya proposes weight deltas and Guru applies them after validation.
|
| 58 |
SFT imprint applied using surface-only gate-layer fine-tuning.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- ssm
|
| 7 |
- state-space-model
|
| 8 |
- causal-lm
|
| 9 |
+
- rabbit
|
| 10 |
- rtaforge
|
| 11 |
+
- proof-of-concept
|
| 12 |
+
base_model: RtaForge/Anvaya-Rabbit-2.7B
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Anvaya-Rabbit 2.7B — v0.1 Alpha
|
| 16 |
|
| 17 |
+
**Proof of concept.** Rabbit is the first model in the Anvaya series — a demonstration
|
| 18 |
+
that a fully custom State-Space Model (SSM) architecture can be trained from scratch,
|
| 19 |
+
on a single GPU, without any dependence on attention or transformer building blocks.
|
| 20 |
+
|
| 21 |
+
This is not a production model. It is the opening move in a deliberate curriculum:
|
| 22 |
+
**Rabbit → Raccoon → Polar Bear.** The architecture, training protocol, and
|
| 23 |
+
infrastructure are the story. The benchmarks are a baseline.
|
| 24 |
|
| 25 |
## Architecture
|
| 26 |
|
| 27 |
+
- **Type**: Ṛta-SSM v7.2.2, Fortress Unbroken — recurrent SSM, no attention
|
| 28 |
- **Parameters**: ~2.78B
|
| 29 |
- **Layers**: 64
|
| 30 |
- **d_model / d_state**: 2560
|
| 31 |
- **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
|
| 32 |
- **Precision**: bfloat16
|
| 33 |
+
- **Training seq_len**: 64
|
| 34 |
|
| 35 |
## Weights
|
| 36 |
|
| 37 |
+
This repository contains the base pretrained checkpoint
|
| 38 |
+
(`base/Anvaya-Rabbit-2.3B-0.1-alpha-base.pt`) and the SFT imprint checkpoint
|
| 39 |
+
(`imprint/Anvaya-Rabbit-2.3b-0.1-alpha-imprint.pt`).
|
| 40 |
+
|
| 41 |
+
Load the imprint weights directly:
|
| 42 |
|
| 43 |
```python
|
|
|
|
| 44 |
from white_rabbit.rabbit_model import create_rabbit_model
|
| 45 |
+
from transformers import AutoTokenizer
|
| 46 |
+
import torch
|
| 47 |
|
| 48 |
model = create_rabbit_model(vocab_size=50280, durga_variant="fu-64")
|
| 49 |
+
sd = torch.load("imprint/Anvaya-Rabbit-2.3b-0.1-alpha-imprint.pt", map_location="cpu")
|
| 50 |
+
model.load_state_dict(sd, strict=False)
|
| 51 |
model.eval()
|
| 52 |
+
|
| 53 |
+
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
| 54 |
```
|
| 55 |
|
| 56 |
+
> **Requires**: `rtaforge-substrates` — this model uses a custom SSM architecture
|
| 57 |
+
> not compatible with standard HuggingFace `AutoModel`.
|
| 58 |
|
| 59 |
+
## Training Curriculum
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
+
One epoch, single L4, ~15,000 steps across 8 phases + 1,500-step Scholar Sprint.
|
| 62 |
+
|
| 63 |
+
| Phase | Steps | Dataset | Focus |
|
| 64 |
+
|-------|-------|---------|-------|
|
| 65 |
+
| 6 | 2,000 | Glaive alignment | Alignment |
|
| 66 |
+
| 7 | 1,500 | Glaive alignment | Alignment |
|
| 67 |
+
|
| 68 |
+
Final Scholar Sprint: 1,500 steps, Phase 5 saturation (Logic Giants corpus).
|
| 69 |
+
**Final checkpoint: Step 1,500.**
|
| 70 |
|
| 71 |
Trained with the Anvaya Gurukul protocol: a constitutional Sisya/Guru loop
|
| 72 |
where Sisya proposes weight deltas and Guru applies them after validation.
|
| 73 |
SFT imprint applied using surface-only gate-layer fine-tuning.
|
| 74 |
+
|
| 75 |
+
## Evaluation Results (Step 1,500)
|
| 76 |
+
|
| 77 |
+
### Internal — Scale-Invariant Metrics
|
| 78 |
+
|
| 79 |
+
Evaluated using Top-K accuracy and Mean Reciprocal Rank vs. a randomly initialised
|
| 80 |
+
baseline of identical architecture. 50 samples per corpus, seq_len=64.
|
| 81 |
+
|
| 82 |
+
| Metric | Random Init | Trained (Step 1,500) | Gain |
|
| 83 |
+
|--------|-------------|----------------------|------|
|
| 84 |
+
| Top-1 Accuracy (aggregate) | 0.24% | **1.90%** | **~8×** |
|
| 85 |
+
| Top-10 Accuracy (aggregate) | 0.24% | **35.84%** | **~149×** |
|
| 86 |
+
| MRR (aggregate) | 0.0026 | **0.1724** | **~66×** |
|
| 87 |
+
| MRR — Deep Math | 0.0084 | **0.186** | **22×** |
|
| 88 |
+
| Top-10 — Biology | ~1.3% | **~12%** | **~10×** |
|
| 89 |
+
| Top-10 — Chemistry | ~1.3% | **~13%** | **~10×** |
|
| 90 |
+
|
| 91 |
+
These gains are measured against a randomly initialised model of identical
|
| 92 |
+
architecture — they reflect what the training curriculum taught, not absolute capability.
|
| 93 |
+
|
| 94 |
+
### Commercial Benchmarks (lm-eval harness)
|
| 95 |
+
|
| 96 |
+
> **Important caveat**: Rabbit was trained at seq_len=64. Standard lm-eval prompts
|
| 97 |
+
> (few-shot examples + question) typically run 150–400 tokens. The scores below reflect
|
| 98 |
+
> inference at context lengths the model was never trained on.
|
| 99 |
+
> Raccoon (seq_len=512) will be evaluated without this constraint.
|
| 100 |
+
|
| 101 |
+
| Benchmark | Score | Notes |
|
| 102 |
+
|-----------|-------|-------|
|
| 103 |
+
| HellaSwag | 25.89% | Near-random; context length exceeds training seq_len |
|
| 104 |
+
| ARC-Challenge | 26.71% | Near-random; context length exceeds training seq_len |
|
| 105 |
+
| MMLU | 26.89% | Near-random; 5-shot prompts well beyond training seq_len |
|
| 106 |
+
| WinoGrande | 48.62% | Near-random |
|
| 107 |
+
| TruthfulQA MC1 | 21.91% | — |
|
| 108 |
+
|
| 109 |
+
## What Comes Next
|
| 110 |
+
|
| 111 |
+
| Model | Params | seq_len | Status |
|
| 112 |
+
|-------|--------|---------|--------|
|
| 113 |
+
| **Rabbit** | 2.7B | 64 | ✅ This model — v0.1 Alpha |
|
| 114 |
+
| **Raccoon** | 2.7B | 512 | In training — reasoning curriculum (math ×2, logic ×2) |
|
| 115 |
+
| **Polar Bear** | ~13B | 512 | Planned — STEM + AEVA anti-hallucination layer |
|
| 116 |
+
|
| 117 |
+
The delta between Rabbit and Raccoon is the story. One epoch → two epochs,
|
| 118 |
+
seq_len 64 → 512. Same pipeline, same hardware philosophy.
|
| 119 |
+
**Give us more resources and watch what happens.**
|