xiaol commited on
Commit
0c43774
·
verified ·
1 Parent(s): e6cb871

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
8
+ - BlinkDL/rwkv-7-world
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
+ ---
12
+
13
+ <div align="center">
14
+ <img src="./figures/banner.jpg" style="border-radius: 10px; width: 100%; height: 100%; object-fit: cover; box-shadow: 10px 10px 20px rgba(0, 0, 0, 0.5); border: 2px solid white;" alt="ARWKV" />
15
+ </div>
16
+
17
+
18
+ <h1 align="center">ARWKV🪿</h1>
19
+
20
+ <p align="center">
21
+ <a href="https://arxiv.org/abs/2501.15570"><b>Paper Link</b>👁️</a> | <a href="https://github.com/yynil/RWKVInside"><b>Github</b>✅</a>
22
+ </p>
23
+
24
+ # ARWKV-7B-GATE-MLP (Preview 0.1)
25
+
26
+ <img src="./figures/architecture.png" alt="ARWKV Hybrid Architecture" width="30%">
27
+
28
+ *Preview version with **RWKV-7** time mixing and Transformer MLP*
29
+
30
+ ## 📌 Overview
31
+
32
+ **ALL YOU NEED IS RWKV**
33
+
34
+ This is an **early preview** of our 7B parameter hybrid RNN-Transformer model, trained on 2k context length **(only stage-2 applied, without SFT or DPO)** through 3-stage knowledge distillation from DeepSeek-R1-Distill-Qwen-1.5B. While being a foundational version, it demonstrates:
35
+
36
+ - ✅ RWKV-7's efficient recurrence mechanism
37
+ - ✅ No self-attention, fully O(n)
38
+ - ✅ Constant VRAM usage
39
+ - ✅ Single-GPU trainability
40
+
41
+ **Roadmap Notice**: We will soon open-source different enhanced versions with:
42
+ - 🚀 16k+ context capability
43
+ - 🧮 Math-specific improvements
44
+ - 📚 RL enhanced reasoning model
45
+
46
+ ## How to use
47
+ ```shell
48
+ pip3 install --upgrade rwkv-fla transformers
49
+ ```
50
+
51
+ ```python
52
+ from transformers import AutoModelForCausalLM, AutoTokenizer
53
+
54
+
55
+ model = AutoModelForCausalLM.from_pretrained(
56
+ "RWKV-Red-Team/ARWKV-R1-1B5",
57
+ device_map="auto",
58
+ torch_dtype=torch.float16,
59
+ trust_remote_code=True,
60
+ )
61
+ tokenizer = AutoTokenizer.from_pretrained(
62
+ "RWKV-Red-Team/ARWKV-R1-1B5"
63
+ )
64
+ ```
65
+
66
+ ## 🔑 Key Features
67
+ | Component | Specification | Note |
68
+ |-----------|---------------|------|
69
+ | Architecture | RWKV-7 TimeMix + SwiGLU | Hybrid design |
70
+ | Context Window | 2048 training CTX | *Preview limitation* |
71
+ | Training Tokens | 40M | Distillation-focused |
72
+ | Precision | FP16 inference recommended(16G Vram required) | 15%↑ vs BF16 |
73
+
74
+ ## 🏗️ Architecture Highlights
75
+ ### Core Modification Flow
76
+ ```diff
77
+ Transformer Decoder Layer:
78
+ - Multi-head Latent Attention(MLA)
79
+ + RWKV-7 Time Mixing (Eq.3)
80
+ - RoPE Positional Encoding
81
+ + State Recurrence
82
+ = Hybrid Layer Output
83
+ ```