File size: 7,042 Bytes
7c50656 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | # DFlash-LoRA-Inject 评测:Accepted Length & Accuracy
## 为什么不能用 sglang 在线评测?
DFlash-LoRA-Inject 的推理需要**逐层注入 target 模型的 hidden states** 到 draft 模型中,
这是 LoRA-Inject 训练时的核心机制。但 sglang 不支持这种推理模式:
| sglang 算法 | 问题 |
|---|---|
| `STANDALONE` | 把 draft 当独立自回归模型跑,**完全忽略 layer injection**。merged 模型 ≈ 原始 Qwen3-8B,accept_length 恒 ≈ 4.7,跟 LoRA 训没训没关系 |
| `DFLASH` | 期望 DFlash-b16 架构(5 层 + fc + hidden_norm),跟 LoRA-Inject(36 层全模型)结构不匹配 |
因此必须**离线评测**:加载 target + draft 两个模型,手动实现带 layer injection 的 speculative decoding 循环。
---
## 基本信息
| 项目 | 路径 / 值 |
|---|---|
| conda 环境 | `spec` |
| 基座模型(target) | `/workspace/models/Qwen3-8B` |
| 训练输出(最终 ckpt) | `.../outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400` |
| 合并后 draft 模型 | `.../outputs/qwen3-8b-dflash-lora-inject-merged` |
| 评测脚本 | `/workspace/hanrui/syxin_old/eval_dflash_lora_inject.py` |
| 本地数据集 | `/workspace/hanrui/datasets/{humaneval,mtbench,gsm8k}` |
| 结果输出目录 | `/workspace/hanrui/syxin_old/Specforge/benchmarks/results/` |
| GPU | 8 × H100 80GB(单卡即可,需 ~32GB 加载两个 8B 模型) |
---
## Step 1:合并 LoRA 权重
LoRA-Inject 训练只保存 adapter 权重,评测时需要完整模型。
```bash
conda activate spec
python3 -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, os
BASE = '/workspace/models/Qwen3-8B'
ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400'
MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged'
if os.path.exists(MERGED):
print(f'[skip] Merged model already exists: {MERGED}')
else:
print('[1/4] Loading base model to CPU ...')
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu')
print('[2/4] Loading LoRA adapter ...')
model = PeftModel.from_pretrained(model, ADAPTER)
print('[3/4] Merging weights ...')
model = model.merge_and_unload()
print('[4/4] Saving merged model ...')
os.makedirs(MERGED, exist_ok=True)
model.save_pretrained(MERGED, safe_serialization=True)
AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED)
print(f'Done. Merged model saved to: {MERGED}')
"
```
> 耗时约 3–5 分钟,CPU 内存占用 ≈ 16 GB。已存在则自动跳过。
---
## Step 2:离线评测 accepted length
**不需要启动 sglang server**,直接跑:
### 全部 Bench(推荐)
```bash
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh
```
### 单独跑 / 快速测试
```bash
# 只跑 HumanEval
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval
# 快速测试(每个 bench 20 条)
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --quick
# 指定 checkpoint
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --ckpt epoch_0_step_1000
# 组合
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval gsm8k --quick
```
### 或者直接调 Python
```bash
conda activate spec
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--benchmarks humaneval mtbench gsm8k \
--block-size 16 \
--max-new-tokens 512 \
--temperature 0.0
```
---
## 结果文件说明
结果保存在 `results/` 下,文件名示例:
```
dflash_lora_inject_offline_epoch_3_step_1400_20260314_150000.json
```
```json
{
"model": "dflash-lora-inject/epoch_3_step_1400",
"block_size": 16,
"humaneval": {
"avg_accept_length": 3.42,
"total_tokens": 28500,
"latency": 120.5,
"throughput": 236.5,
"num_samples": 164,
"num_verify_rounds": 8320
},
"mtbench": { ... },
"gsm8k": { ... }
}
```
| 字段 | 含义 |
|---|---|
| `avg_accept_length` | **核心指标**:平均每次 verify 接受的 token 数(含 injection)。越高越好,`1.0` = draft 完全无效 |
| `total_tokens` | 总生成 token 数 |
| `throughput` | tokens/s(离线评测,不含 batching 优化) |
| `num_verify_rounds` | 总验证轮数 |
---
## 对比 baseline
对比未经 LoRA 训练的原始 Qwen3-8B 当 draft 的 accept_length:
```bash
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--merged-path /workspace/models/Qwen3-8B \
--benchmarks humaneval mtbench gsm8k \
--num-samples 50
```
> 这会用原始 Qwen3-8B 同时当 target 和 draft(带 injection),
> 对比 LoRA 训练前后 accept_length 是否有提升。
---
## 如何测其他 checkpoint
```bash
# 方法 1:直接加载 adapter(自动 merge,不保存)
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--ckpt epoch_0_step_1000 \
--benchmarks humaneval --num-samples 50
# 方法 2:预先 merge 到不同目录
python3 -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, os
BASE = '/workspace/models/Qwen3-8B'
ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_0_step_1000'
MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000'
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu')
model = PeftModel.from_pretrained(model, ADAPTER).merge_and_unload()
os.makedirs(MERGED, exist_ok=True)
model.save_pretrained(MERGED, safe_serialization=True)
AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED)
"
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--merged-path .../qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000 \
--benchmarks humaneval --num-samples 50
```
可用 checkpoint:`epoch_0_step_500` / `epoch_0_step_1000` / `epoch_0_step_1400` / `epoch_2_step_34500` / `epoch_2_step_35000` / `epoch_3_step_1400`
---
## 常见问题
### Q1:accept_length 和 STANDALONE 模式下差不多(都 ≈ 4.7)
这说明 layer injection 没有真正起作用。检查:
- 评测脚本确实用的是 `eval_dflash_lora_inject.py`(离线),不是 sglang bench
- merged 模型确实是 LoRA-Inject 版本(不是原始 Qwen3-8B)
### Q2:OOM(单卡放不下两个 8B 模型)
两个 bf16 的 Qwen3-8B ≈ 32GB,单卡 H100 80GB 够用。如果 OOM:
- 检查是否有其他进程占用显存
- 减小 `--max-new-tokens`(试 256)
- 减小 `--num-samples`
### Q3:数据集下载失败(无外网)
评测脚本优先读本地文件:
| bench | 本地文件 |
|---|---|
| GSM8K | `/workspace/hanrui/datasets/gsm8k/test.jsonl` |
| MT-Bench | `/workspace/hanrui/datasets/mtbench/question.jsonl` |
| HumanEval | `/workspace/hanrui/datasets/humaneval/test.jsonl` |
---
*基座:`/workspace/models/Qwen3-8B` | 最终 ckpt:`epoch_3_step_1400` | block_size:16*
|