# DFlash-LoRA-Inject 评测:Accepted Length & Accuracy ## 为什么不能用 sglang 在线评测? DFlash-LoRA-Inject 的推理需要**逐层注入 target 模型的 hidden states** 到 draft 模型中, 这是 LoRA-Inject 训练时的核心机制。但 sglang 不支持这种推理模式: | sglang 算法 | 问题 | |---|---| | `STANDALONE` | 把 draft 当独立自回归模型跑,**完全忽略 layer injection**。merged 模型 ≈ 原始 Qwen3-8B,accept_length 恒 ≈ 4.7,跟 LoRA 训没训没关系 | | `DFLASH` | 期望 DFlash-b16 架构(5 层 + fc + hidden_norm),跟 LoRA-Inject(36 层全模型)结构不匹配 | 因此必须**离线评测**:加载 target + draft 两个模型,手动实现带 layer injection 的 speculative decoding 循环。 --- ## 基本信息 | 项目 | 路径 / 值 | |---|---| | conda 环境 | `spec` | | 基座模型(target) | `/workspace/models/Qwen3-8B` | | 训练输出(最终 ckpt) | `.../outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400` | | 合并后 draft 模型 | `.../outputs/qwen3-8b-dflash-lora-inject-merged` | | 评测脚本 | `/workspace/hanrui/syxin_old/eval_dflash_lora_inject.py` | | 本地数据集 | `/workspace/hanrui/datasets/{humaneval,mtbench,gsm8k}` | | 结果输出目录 | `/workspace/hanrui/syxin_old/Specforge/benchmarks/results/` | | GPU | 8 × H100 80GB(单卡即可,需 ~32GB 加载两个 8B 模型) | --- ## Step 1:合并 LoRA 权重 LoRA-Inject 训练只保存 adapter 权重,评测时需要完整模型。 ```bash conda activate spec python3 -c " from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch, os BASE = '/workspace/models/Qwen3-8B' ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400' MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged' if os.path.exists(MERGED): print(f'[skip] Merged model already exists: {MERGED}') else: print('[1/4] Loading base model to CPU ...') model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu') print('[2/4] Loading LoRA adapter ...') model = PeftModel.from_pretrained(model, ADAPTER) print('[3/4] Merging weights ...') model = model.merge_and_unload() print('[4/4] Saving merged model ...') os.makedirs(MERGED, exist_ok=True) model.save_pretrained(MERGED, safe_serialization=True) AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED) print(f'Done. Merged model saved to: {MERGED}') " ``` > 耗时约 3–5 分钟,CPU 内存占用 ≈ 16 GB。已存在则自动跳过。 --- ## Step 2:离线评测 accepted length **不需要启动 sglang server**,直接跑: ### 全部 Bench(推荐) ```bash bash /workspace/hanrui/syxin_old/run_bench_dflash.sh ``` ### 单独跑 / 快速测试 ```bash # 只跑 HumanEval bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval # 快速测试(每个 bench 20 条) bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --quick # 指定 checkpoint bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --ckpt epoch_0_step_1000 # 组合 bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval gsm8k --quick ``` ### 或者直接调 Python ```bash conda activate spec python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ --benchmarks humaneval mtbench gsm8k \ --block-size 16 \ --max-new-tokens 512 \ --temperature 0.0 ``` --- ## 结果文件说明 结果保存在 `results/` 下,文件名示例: ``` dflash_lora_inject_offline_epoch_3_step_1400_20260314_150000.json ``` ```json { "model": "dflash-lora-inject/epoch_3_step_1400", "block_size": 16, "humaneval": { "avg_accept_length": 3.42, "total_tokens": 28500, "latency": 120.5, "throughput": 236.5, "num_samples": 164, "num_verify_rounds": 8320 }, "mtbench": { ... }, "gsm8k": { ... } } ``` | 字段 | 含义 | |---|---| | `avg_accept_length` | **核心指标**:平均每次 verify 接受的 token 数(含 injection)。越高越好,`1.0` = draft 完全无效 | | `total_tokens` | 总生成 token 数 | | `throughput` | tokens/s(离线评测,不含 batching 优化) | | `num_verify_rounds` | 总验证轮数 | --- ## 对比 baseline 对比未经 LoRA 训练的原始 Qwen3-8B 当 draft 的 accept_length: ```bash python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ --merged-path /workspace/models/Qwen3-8B \ --benchmarks humaneval mtbench gsm8k \ --num-samples 50 ``` > 这会用原始 Qwen3-8B 同时当 target 和 draft(带 injection), > 对比 LoRA 训练前后 accept_length 是否有提升。 --- ## 如何测其他 checkpoint ```bash # 方法 1:直接加载 adapter(自动 merge,不保存) python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ --ckpt epoch_0_step_1000 \ --benchmarks humaneval --num-samples 50 # 方法 2:预先 merge 到不同目录 python3 -c " from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch, os BASE = '/workspace/models/Qwen3-8B' ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_0_step_1000' MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000' model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu') model = PeftModel.from_pretrained(model, ADAPTER).merge_and_unload() os.makedirs(MERGED, exist_ok=True) model.save_pretrained(MERGED, safe_serialization=True) AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED) " python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ --merged-path .../qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000 \ --benchmarks humaneval --num-samples 50 ``` 可用 checkpoint:`epoch_0_step_500` / `epoch_0_step_1000` / `epoch_0_step_1400` / `epoch_2_step_34500` / `epoch_2_step_35000` / `epoch_3_step_1400` --- ## 常见问题 ### Q1:accept_length 和 STANDALONE 模式下差不多(都 ≈ 4.7) 这说明 layer injection 没有真正起作用。检查: - 评测脚本确实用的是 `eval_dflash_lora_inject.py`(离线),不是 sglang bench - merged 模型确实是 LoRA-Inject 版本(不是原始 Qwen3-8B) ### Q2:OOM(单卡放不下两个 8B 模型) 两个 bf16 的 Qwen3-8B ≈ 32GB,单卡 H100 80GB 够用。如果 OOM: - 检查是否有其他进程占用显存 - 减小 `--max-new-tokens`(试 256) - 减小 `--num-samples` ### Q3:数据集下载失败(无外网) 评测脚本优先读本地文件: | bench | 本地文件 | |---|---| | GSM8K | `/workspace/hanrui/datasets/gsm8k/test.jsonl` | | MT-Bench | `/workspace/hanrui/datasets/mtbench/question.jsonl` | | HumanEval | `/workspace/hanrui/datasets/humaneval/test.jsonl` | --- *基座:`/workspace/models/Qwen3-8B` | 最终 ckpt:`epoch_3_step_1400` | block_size:16*