DFlash-LoRA-Inject 评测:Accepted Length & Accuracy
为什么不能用 sglang 在线评测?
DFlash-LoRA-Inject 的推理需要逐层注入 target 模型的 hidden states 到 draft 模型中, 这是 LoRA-Inject 训练时的核心机制。但 sglang 不支持这种推理模式:
| sglang 算法 | 问题 |
|---|---|
STANDALONE |
把 draft 当独立自回归模型跑,完全忽略 layer injection。merged 模型 ≈ 原始 Qwen3-8B,accept_length 恒 ≈ 4.7,跟 LoRA 训没训没关系 |
DFLASH |
期望 DFlash-b16 架构(5 层 + fc + hidden_norm),跟 LoRA-Inject(36 层全模型)结构不匹配 |
因此必须离线评测:加载 target + draft 两个模型,手动实现带 layer injection 的 speculative decoding 循环。
基本信息
| 项目 | 路径 / 值 |
|---|---|
| conda 环境 | spec |
| 基座模型(target) | /workspace/models/Qwen3-8B |
| 训练输出(最终 ckpt) | .../outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400 |
| 合并后 draft 模型 | .../outputs/qwen3-8b-dflash-lora-inject-merged |
| 评测脚本 | /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py |
| 本地数据集 | /workspace/hanrui/datasets/{humaneval,mtbench,gsm8k} |
| 结果输出目录 | /workspace/hanrui/syxin_old/Specforge/benchmarks/results/ |
| GPU | 8 × H100 80GB(单卡即可,需 ~32GB 加载两个 8B 模型) |
Step 1:合并 LoRA 权重
LoRA-Inject 训练只保存 adapter 权重,评测时需要完整模型。
conda activate spec
python3 -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, os
BASE = '/workspace/models/Qwen3-8B'
ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400'
MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged'
if os.path.exists(MERGED):
print(f'[skip] Merged model already exists: {MERGED}')
else:
print('[1/4] Loading base model to CPU ...')
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu')
print('[2/4] Loading LoRA adapter ...')
model = PeftModel.from_pretrained(model, ADAPTER)
print('[3/4] Merging weights ...')
model = model.merge_and_unload()
print('[4/4] Saving merged model ...')
os.makedirs(MERGED, exist_ok=True)
model.save_pretrained(MERGED, safe_serialization=True)
AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED)
print(f'Done. Merged model saved to: {MERGED}')
"
耗时约 3–5 分钟,CPU 内存占用 ≈ 16 GB。已存在则自动跳过。
Step 2:离线评测 accepted length
不需要启动 sglang server,直接跑:
全部 Bench(推荐)
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh
单独跑 / 快速测试
# 只跑 HumanEval
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval
# 快速测试(每个 bench 20 条)
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --quick
# 指定 checkpoint
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --ckpt epoch_0_step_1000
# 组合
bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval gsm8k --quick
或者直接调 Python
conda activate spec
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--benchmarks humaneval mtbench gsm8k \
--block-size 16 \
--max-new-tokens 512 \
--temperature 0.0
结果文件说明
结果保存在 results/ 下,文件名示例:
dflash_lora_inject_offline_epoch_3_step_1400_20260314_150000.json
{
"model": "dflash-lora-inject/epoch_3_step_1400",
"block_size": 16,
"humaneval": {
"avg_accept_length": 3.42,
"total_tokens": 28500,
"latency": 120.5,
"throughput": 236.5,
"num_samples": 164,
"num_verify_rounds": 8320
},
"mtbench": { ... },
"gsm8k": { ... }
}
| 字段 | 含义 |
|---|---|
avg_accept_length |
核心指标:平均每次 verify 接受的 token 数(含 injection)。越高越好,1.0 = draft 完全无效 |
total_tokens |
总生成 token 数 |
throughput |
tokens/s(离线评测,不含 batching 优化) |
num_verify_rounds |
总验证轮数 |
对比 baseline
对比未经 LoRA 训练的原始 Qwen3-8B 当 draft 的 accept_length:
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--merged-path /workspace/models/Qwen3-8B \
--benchmarks humaneval mtbench gsm8k \
--num-samples 50
这会用原始 Qwen3-8B 同时当 target 和 draft(带 injection), 对比 LoRA 训练前后 accept_length 是否有提升。
如何测其他 checkpoint
# 方法 1:直接加载 adapter(自动 merge,不保存)
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--ckpt epoch_0_step_1000 \
--benchmarks humaneval --num-samples 50
# 方法 2:预先 merge 到不同目录
python3 -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, os
BASE = '/workspace/models/Qwen3-8B'
ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_0_step_1000'
MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000'
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu')
model = PeftModel.from_pretrained(model, ADAPTER).merge_and_unload()
os.makedirs(MERGED, exist_ok=True)
model.save_pretrained(MERGED, safe_serialization=True)
AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED)
"
python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \
--merged-path .../qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000 \
--benchmarks humaneval --num-samples 50
可用 checkpoint:epoch_0_step_500 / epoch_0_step_1000 / epoch_0_step_1400 / epoch_2_step_34500 / epoch_2_step_35000 / epoch_3_step_1400
常见问题
Q1:accept_length 和 STANDALONE 模式下差不多(都 ≈ 4.7)
这说明 layer injection 没有真正起作用。检查:
- 评测脚本确实用的是
eval_dflash_lora_inject.py(离线),不是 sglang bench - merged 模型确实是 LoRA-Inject 版本(不是原始 Qwen3-8B)
Q2:OOM(单卡放不下两个 8B 模型)
两个 bf16 的 Qwen3-8B ≈ 32GB,单卡 H100 80GB 够用。如果 OOM:
- 检查是否有其他进程占用显存
- 减小
--max-new-tokens(试 256) - 减小
--num-samples
Q3:数据集下载失败(无外网)
评测脚本优先读本地文件:
| bench | 本地文件 |
|---|---|
| GSM8K | /workspace/hanrui/datasets/gsm8k/test.jsonl |
| MT-Bench | /workspace/hanrui/datasets/mtbench/question.jsonl |
| HumanEval | /workspace/hanrui/datasets/humaneval/test.jsonl |
基座:/workspace/models/Qwen3-8B | 最终 ckpt:epoch_3_step_1400 | block_size:16