fix: consistent 2.7B naming, reviewer fixes, rtaforge-substrates note
Browse files
README.md
CHANGED
|
@@ -14,18 +14,19 @@ base_model: RtaForge/Anvaya-Rabbit-2.7B
|
|
| 14 |
|
| 15 |
# Anvaya-Rabbit 2.7B β v0.1 Alpha
|
| 16 |
|
| 17 |
-
**
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
This is not a production model. It is the opening move in a deliberate curriculum:
|
| 22 |
-
**Rabbit β Raccoon β Polar Bear.** The
|
| 23 |
-
infrastructure are the story. The benchmarks are a baseline.
|
| 24 |
|
| 25 |
## Architecture
|
| 26 |
|
| 27 |
- **Type**: αΉta-SSM v7.2.2, Fortress Unbroken β recurrent SSM, no attention
|
| 28 |
-
- **Parameters**: ~2.
|
| 29 |
- **Layers**: 64
|
| 30 |
- **d_model / d_state**: 2560
|
| 31 |
- **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
|
|
@@ -35,30 +36,35 @@ infrastructure are the story. The benchmarks are a baseline.
|
|
| 35 |
## Weights
|
| 36 |
|
| 37 |
This repository contains the base pretrained checkpoint
|
| 38 |
-
(`base/Anvaya-Rabbit-2.
|
| 39 |
-
(`imprint/Anvaya-Rabbit-2.
|
| 40 |
|
| 41 |
-
Load the imprint weights
|
| 42 |
|
| 43 |
```python
|
| 44 |
from white_rabbit.rabbit_model import create_rabbit_model
|
| 45 |
from transformers import AutoTokenizer
|
| 46 |
import torch
|
| 47 |
|
| 48 |
-
model = create_rabbit_model(
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
| 50 |
model.load_state_dict(sd, strict=False)
|
| 51 |
model.eval()
|
| 52 |
|
| 53 |
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
| 54 |
```
|
| 55 |
|
| 56 |
-
> **Requires**: `rtaforge-substrates`
|
|
|
|
| 57 |
> not compatible with standard HuggingFace `AutoModel`.
|
| 58 |
|
| 59 |
## Training Curriculum
|
| 60 |
|
| 61 |
-
One epoch, single L4, ~15,000 steps across 8 phases + 1,500-step Scholar Sprint.
|
|
|
|
| 62 |
|
| 63 |
| Phase | Steps | Dataset | Focus |
|
| 64 |
|-------|-------|---------|-------|
|
|
@@ -89,13 +95,14 @@ baseline of identical architecture. 50 samples per corpus, seq_len=64.
|
|
| 89 |
| Top-10 β Chemistry | ~1.3% | **~13%** | **~10Γ** |
|
| 90 |
|
| 91 |
These gains are measured against a randomly initialised model of identical
|
| 92 |
-
architecture β they reflect what the training curriculum taught, not absolute
|
|
|
|
| 93 |
|
| 94 |
### Commercial Benchmarks (lm-eval harness)
|
| 95 |
|
| 96 |
> **Important caveat**: Rabbit was trained at seq_len=64. Standard lm-eval prompts
|
| 97 |
-
> (few-shot examples + question) typically run 150β400 tokens. The scores below
|
| 98 |
-
> inference at context lengths the model was never trained on.
|
| 99 |
> Raccoon (seq_len=512) will be evaluated without this constraint.
|
| 100 |
|
| 101 |
| Benchmark | Score | Notes |
|
|
@@ -110,8 +117,8 @@ architecture β they reflect what the training curriculum taught, not absolute
|
|
| 110 |
|
| 111 |
| Model | Params | seq_len | Status |
|
| 112 |
|-------|--------|---------|--------|
|
| 113 |
-
| **Rabbit** | 2.7B | 64 | β
This model β v0.1 Alpha |
|
| 114 |
-
| **Raccoon** | 2.7B | 512 | In training β reasoning curriculum (math Γ2, logic Γ2) |
|
| 115 |
| **Polar Bear** | ~13B | 512 | Planned β STEM + AEVA anti-hallucination layer |
|
| 116 |
|
| 117 |
The delta between Rabbit and Raccoon is the story. One epoch β two epochs,
|
|
|
|
| 14 |
|
| 15 |
# Anvaya-Rabbit 2.7B β v0.1 Alpha
|
| 16 |
|
| 17 |
+
**The architecture, training protocol, and infrastructure are the story.**
|
| 18 |
+
Rabbit is the first model in the Anvaya series β a proof of concept demonstrating
|
| 19 |
+
that a fully custom State-Space Model (SSM) can be trained from scratch, on a
|
| 20 |
+
single consumer-grade GPU, with no dependence on attention or transformer
|
| 21 |
+
building blocks.
|
| 22 |
|
| 23 |
This is not a production model. It is the opening move in a deliberate curriculum:
|
| 24 |
+
**Rabbit β Raccoon β Polar Bear.** The benchmarks below are a baseline, not a claim.
|
|
|
|
| 25 |
|
| 26 |
## Architecture
|
| 27 |
|
| 28 |
- **Type**: αΉta-SSM v7.2.2, Fortress Unbroken β recurrent SSM, no attention
|
| 29 |
+
- **Parameters**: ~2.7B (post-subsumination)
|
| 30 |
- **Layers**: 64
|
| 31 |
- **d_model / d_state**: 2560
|
| 32 |
- **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
|
|
|
|
| 36 |
## Weights
|
| 37 |
|
| 38 |
This repository contains the base pretrained checkpoint
|
| 39 |
+
(`base/Anvaya-Rabbit-2.7B-0.1-alpha-base.pt`) and the SFT imprint checkpoint
|
| 40 |
+
(`imprint/Anvaya-Rabbit-2.7B-0.1-alpha-imprint.pt`).
|
| 41 |
|
| 42 |
+
Load the imprint weights (base + SFT overlay, recommended for inference):
|
| 43 |
|
| 44 |
```python
|
| 45 |
from white_rabbit.rabbit_model import create_rabbit_model
|
| 46 |
from transformers import AutoTokenizer
|
| 47 |
import torch
|
| 48 |
|
| 49 |
+
model = create_rabbit_model(
|
| 50 |
+
vocab_size=50280,
|
| 51 |
+
durga_variant="fu-64", # 64-layer Fortress Unbroken backbone
|
| 52 |
+
)
|
| 53 |
+
sd = torch.load("imprint/Anvaya-Rabbit-2.7B-0.1-alpha-imprint.pt", map_location="cpu")
|
| 54 |
model.load_state_dict(sd, strict=False)
|
| 55 |
model.eval()
|
| 56 |
|
| 57 |
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
| 58 |
```
|
| 59 |
|
| 60 |
+
> **Requires**: `rtaforge-substrates` (private repository β contact
|
| 61 |
+
> guha@rtaforge.in for access). This model uses a custom SSM architecture
|
| 62 |
> not compatible with standard HuggingFace `AutoModel`.
|
| 63 |
|
| 64 |
## Training Curriculum
|
| 65 |
|
| 66 |
+
One epoch, single NVIDIA L4, ~15,000 steps across 8 phases + 1,500-step Scholar Sprint.
|
| 67 |
+
Phases 1β5 (pretraining corpus progression) not shown.
|
| 68 |
|
| 69 |
| Phase | Steps | Dataset | Focus |
|
| 70 |
|-------|-------|---------|-------|
|
|
|
|
| 95 |
| Top-10 β Chemistry | ~1.3% | **~13%** | **~10Γ** |
|
| 96 |
|
| 97 |
These gains are measured against a randomly initialised model of identical
|
| 98 |
+
architecture β they reflect what the training curriculum taught, not absolute
|
| 99 |
+
capability.
|
| 100 |
|
| 101 |
### Commercial Benchmarks (lm-eval harness)
|
| 102 |
|
| 103 |
> **Important caveat**: Rabbit was trained at seq_len=64. Standard lm-eval prompts
|
| 104 |
+
> (few-shot examples + question) typically run 150β400 tokens. The scores below
|
| 105 |
+
> reflect inference at context lengths the model was never trained on.
|
| 106 |
> Raccoon (seq_len=512) will be evaluated without this constraint.
|
| 107 |
|
| 108 |
| Benchmark | Score | Notes |
|
|
|
|
| 117 |
|
| 118 |
| Model | Params | seq_len | Status |
|
| 119 |
|-------|--------|---------|--------|
|
| 120 |
+
| **Rabbit** | ~2.7B | 64 | β
This model β v0.1 Alpha |
|
| 121 |
+
| **Raccoon** | ~2.7B | 512 | In training β reasoning curriculum (math Γ2, logic Γ2) |
|
| 122 |
| **Polar Bear** | ~13B | 512 | Planned β STEM + AEVA anti-hallucination layer |
|
| 123 |
|
| 124 |
The delta between Rabbit and Raccoon is the story. One epoch β two epochs,
|