RWKV-Red-Team
/

ARWKV-R1-1B5

Text Generation

Model card Files Files and versions

xiaol commited on Feb 7, 2025

Commit

0c43774

·

verified ·

1 Parent(s): e6cb871

Create README.md

Files changed (1) hide show

README.md +83 -0

README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+---
+license: apache-2.0
+language:
+- en
+- zh
+base_model:
+- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+- BlinkDL/rwkv-7-world
+pipeline_tag: text-generation
+library_name: transformers
+---
+<div align="center">
+  <img src="./figures/banner.jpg" style="border-radius: 10px; width: 100%; height: 100%; object-fit: cover;  box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="ARWKV" />
+</div>
+  <h1 align="center">ARWKV🪿</h1>
+<p align="center">
+  <a href="https://arxiv.org/abs/2501.15570"><b>Paper Link</b>👁️</a>  |  <a href="https://github.com/yynil/RWKVInside"><b>Github</b>✅</a>
+</p>
+# ARWKV-7B-GATE-MLP (Preview 0.1)
+<img src="./figures/architecture.png" alt="ARWKV Hybrid Architecture"  width="30%">
+*Preview version with **RWKV-7** time mixing and Transformer MLP*
+## 📌 Overview
+**ALL YOU NEED IS RWKV**
+This is an **early preview** of our 7B parameter hybrid RNN-Transformer model, trained on 2k context length **(only stage-2 applied, without SFT or DPO)** through 3-stage knowledge distillation from DeepSeek-R1-Distill-Qwen-1.5B. While being a foundational version, it demonstrates:
+- ✅ RWKV-7's efficient recurrence mechanism
+- ✅ No self-attention, fully O(n)
+- ✅ Constant VRAM usage
+- ✅ Single-GPU trainability
+**Roadmap Notice**: We will soon open-source different enhanced versions with:
+- 🚀 16k+ context capability
+- 🧮 Math-specific improvements
+- 📚 RL enhanced reasoning model
+## How to use
+```shell
+pip3 install --upgrade rwkv-fla transformers
+```
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "RWKV-Red-Team/ARWKV-R1-1B5",
+    device_map="auto",
+    torch_dtype=torch.float16,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "RWKV-Red-Team/ARWKV-R1-1B5"
+)
+```
+## 🔑 Key Features
+| Component | Specification | Note |
+|-----------|---------------|------|
+| Architecture | RWKV-7 TimeMix + SwiGLU | Hybrid design |
+| Context Window | 2048 training CTX | *Preview limitation* |
+| Training Tokens | 40M | Distillation-focused |
+| Precision | FP16 inference recommended(16G Vram required) | 15%↑ vs BF16 |
+## 🏗️ Architecture Highlights
+### Core Modification Flow
+```diff
+Transformer Decoder Layer:
+- Multi-head Latent Attention(MLA)
++ RWKV-7 Time Mixing (Eq.3)
+- RoPE Positional Encoding
++ State Recurrence
+= Hybrid Layer Output
+```