RtaForge
/

Anvaya-Rabbit-2.7B

@@ -6,53 +6,114 @@ tags:
 - ssm
 - state-space-model
 - causal-lm
-- raccoon
 - rtaforge
-base_model: RtaForge/Anvaya-Raccoon2.7B
 ---
-# Anvaya-Raccoon 2.7B
-A 2.7B parameter State-Space Model (SSM) trained by RtaForge using the Gurukul
-constitutional training protocol.
 ## Architecture
-- **Type**: Ṛta-SSM v7.2.2-FU (Fortress Unbroken) — recurrent SSM, no attention
 - **Parameters**: ~2.78B
 - **Layers**: 64
 - **d_model / d_state**: 2560
 - **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
 - **Precision**: bfloat16
 ## Weights
-This repository contains a single merged checkpoint (`v1.1/model.pt`) that
-combines the base pretrained weights with the SFT imprint surface layer.
-Load it directly:
 ```python
-import torch
 from white_rabbit.rabbit_model import create_rabbit_model
 model = create_rabbit_model(vocab_size=50280, durga_variant="fu-64")
-sd = torch.load("model.pt", map_location="cpu")
-model.load_state_dict(sd, strict=True)
 model.eval()
 ```
-## Benchmarks
-| Task | Metric | Score |
-|------|--------|-------|
-| HellaSwag | acc_norm | 25.89% |
-| ARC-Challenge | acc_norm | 26.71% |
-| MMLU | acc | 26.89% |
-| WinoGrande | acc | 48.62% |
-| TruthfulQA MC1 | acc | 21.91% |
-## Training
 Trained with the Anvaya Gurukul protocol: a constitutional Sisya/Guru loop
 where Sisya proposes weight deltas and Guru applies them after validation.
 SFT imprint applied using surface-only gate-layer fine-tuning.

 - ssm
 - state-space-model
 - causal-lm
+- rabbit
 - rtaforge
+- proof-of-concept
+base_model: RtaForge/Anvaya-Rabbit-2.7B
 ---
+# Anvaya-Rabbit 2.7B — v0.1 Alpha
+**Proof of concept.** Rabbit is the first model in the Anvaya series — a demonstration
+that a fully custom State-Space Model (SSM) architecture can be trained from scratch,
+on a single GPU, without any dependence on attention or transformer building blocks.
+This is not a production model. It is the opening move in a deliberate curriculum:
+**Rabbit → Raccoon → Polar Bear.** The architecture, training protocol, and
+infrastructure are the story. The benchmarks are a baseline.
 ## Architecture
+- **Type**: Ṛta-SSM v7.2.2, Fortress Unbroken — recurrent SSM, no attention
 - **Parameters**: ~2.78B
 - **Layers**: 64
 - **d_model / d_state**: 2560
 - **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
 - **Precision**: bfloat16
+- **Training seq_len**: 64
 ## Weights
+This repository contains the base pretrained checkpoint
+(`base/Anvaya-Rabbit-2.3B-0.1-alpha-base.pt`) and the SFT imprint checkpoint
+(`imprint/Anvaya-Rabbit-2.3b-0.1-alpha-imprint.pt`).
+Load the imprint weights directly:
 ```python
 from white_rabbit.rabbit_model import create_rabbit_model
+from transformers import AutoTokenizer
+import torch
 model = create_rabbit_model(vocab_size=50280, durga_variant="fu-64")
+sd = torch.load("imprint/Anvaya-Rabbit-2.3b-0.1-alpha-imprint.pt", map_location="cpu")
+model.load_state_dict(sd, strict=False)
 model.eval()
+tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
 ```
+> **Requires**: `rtaforge-substrates` — this model uses a custom SSM architecture
+> not compatible with standard HuggingFace `AutoModel`.
+## Training Curriculum
+One epoch, single L4, ~15,000 steps across 8 phases + 1,500-step Scholar Sprint.
+| Phase | Steps | Dataset | Focus |
+|-------|-------|---------|-------|
+| 6 | 2,000 | Glaive alignment | Alignment |
+| 7 | 1,500 | Glaive alignment | Alignment |
+Final Scholar Sprint: 1,500 steps, Phase 5 saturation (Logic Giants corpus).
+**Final checkpoint: Step 1,500.**
 Trained with the Anvaya Gurukul protocol: a constitutional Sisya/Guru loop
 where Sisya proposes weight deltas and Guru applies them after validation.
 SFT imprint applied using surface-only gate-layer fine-tuning.
+## Evaluation Results (Step 1,500)
+### Internal — Scale-Invariant Metrics
+Evaluated using Top-K accuracy and Mean Reciprocal Rank vs. a randomly initialised
+baseline of identical architecture. 50 samples per corpus, seq_len=64.
+| Metric | Random Init | Trained (Step 1,500) | Gain |
+|--------|-------------|----------------------|------|
+| Top-1 Accuracy (aggregate) | 0.24% | **1.90%** | **~8×** |
+| Top-10 Accuracy (aggregate) | 0.24% | **35.84%** | **~149×** |
+| MRR (aggregate) | 0.0026 | **0.1724** | **~66×** |
+| MRR — Deep Math | 0.0084 | **0.186** | **22×** |
+| Top-10 — Biology | ~1.3% | **~12%** | **~10×** |
+| Top-10 — Chemistry | ~1.3% | **~13%** | **~10×** |
+These gains are measured against a randomly initialised model of identical
+architecture — they reflect what the training curriculum taught, not absolute capability.
+### Commercial Benchmarks (lm-eval harness)
+> **Important caveat**: Rabbit was trained at seq_len=64. Standard lm-eval prompts
+> (few-shot examples + question) typically run 150–400 tokens. The scores below reflect
+> inference at context lengths the model was never trained on.
+> Raccoon (seq_len=512) will be evaluated without this constraint.
+| Benchmark | Score | Notes |
+|-----------|-------|-------|
+| HellaSwag | 25.89% | Near-random; context length exceeds training seq_len |
+| ARC-Challenge | 26.71% | Near-random; context length exceeds training seq_len |
+| MMLU | 26.89% | Near-random; 5-shot prompts well beyond training seq_len |
+| WinoGrande | 48.62% | Near-random |
+| TruthfulQA MC1 | 21.91% | — |
+## What Comes Next
+| Model | Params | seq_len | Status |
+|-------|--------|---------|--------|
+| **Rabbit** | 2.7B | 64 | ✅ This model — v0.1 Alpha |
+| **Raccoon** | 2.7B | 512 | In training — reasoning curriculum (math ×2, logic ×2) |
+| **Polar Bear** | ~13B | 512 | Planned — STEM + AEVA anti-hallucination layer |
+The delta between Rabbit and Raccoon is the story. One epoch → two epochs,
+seq_len 64 → 512. Same pipeline, same hardware philosophy.
+**Give us more resources and watch what happens.**