| # DFlash-LoRA-Inject 评测:Accepted Length & Accuracy |
|
|
| ## 为什么不能用 sglang 在线评测? |
|
|
| DFlash-LoRA-Inject 的推理需要**逐层注入 target 模型的 hidden states** 到 draft 模型中, |
| 这是 LoRA-Inject 训练时的核心机制。但 sglang 不支持这种推理模式: |
|
|
| | sglang 算法 | 问题 | |
| |---|---| |
| | `STANDALONE` | 把 draft 当独立自回归模型跑,**完全忽略 layer injection**。merged 模型 ≈ 原始 Qwen3-8B,accept_length 恒 ≈ 4.7,跟 LoRA 训没训没关系 | |
| | `DFLASH` | 期望 DFlash-b16 架构(5 层 + fc + hidden_norm),跟 LoRA-Inject(36 层全模型)结构不匹配 | |
|
|
| 因此必须**离线评测**:加载 target + draft 两个模型,手动实现带 layer injection 的 speculative decoding 循环。 |
|
|
| --- |
|
|
| ## 基本信息 |
|
|
| | 项目 | 路径 / 值 | |
| |---|---| |
| | conda 环境 | `spec` | |
| | 基座模型(target) | `/workspace/models/Qwen3-8B` | |
| | 训练输出(最终 ckpt) | `.../outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400` | |
| | 合并后 draft 模型 | `.../outputs/qwen3-8b-dflash-lora-inject-merged` | |
| | 评测脚本 | `/workspace/hanrui/syxin_old/eval_dflash_lora_inject.py` | |
| | 本地数据集 | `/workspace/hanrui/datasets/{humaneval,mtbench,gsm8k}` | |
| | 结果输出目录 | `/workspace/hanrui/syxin_old/Specforge/benchmarks/results/` | |
| | GPU | 8 × H100 80GB(单卡即可,需 ~32GB 加载两个 8B 模型) | |
|
|
| --- |
|
|
| ## Step 1:合并 LoRA 权重 |
|
|
| LoRA-Inject 训练只保存 adapter 权重,评测时需要完整模型。 |
|
|
| ```bash |
| conda activate spec |
| |
| python3 -c " |
| from peft import PeftModel |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch, os |
| |
| BASE = '/workspace/models/Qwen3-8B' |
| ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_3_step_1400' |
| MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged' |
| |
| if os.path.exists(MERGED): |
| print(f'[skip] Merged model already exists: {MERGED}') |
| else: |
| print('[1/4] Loading base model to CPU ...') |
| model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu') |
| print('[2/4] Loading LoRA adapter ...') |
| model = PeftModel.from_pretrained(model, ADAPTER) |
| print('[3/4] Merging weights ...') |
| model = model.merge_and_unload() |
| print('[4/4] Saving merged model ...') |
| os.makedirs(MERGED, exist_ok=True) |
| model.save_pretrained(MERGED, safe_serialization=True) |
| AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED) |
| print(f'Done. Merged model saved to: {MERGED}') |
| " |
| ``` |
|
|
| > 耗时约 3–5 分钟,CPU 内存占用 ≈ 16 GB。已存在则自动跳过。 |
|
|
| --- |
|
|
| ## Step 2:离线评测 accepted length |
|
|
| **不需要启动 sglang server**,直接跑: |
|
|
| ### 全部 Bench(推荐) |
|
|
| ```bash |
| bash /workspace/hanrui/syxin_old/run_bench_dflash.sh |
| ``` |
|
|
| ### 单独跑 / 快速测试 |
|
|
| ```bash |
| # 只跑 HumanEval |
| bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval |
| |
| # 快速测试(每个 bench 20 条) |
| bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --quick |
| |
| # 指定 checkpoint |
| bash /workspace/hanrui/syxin_old/run_bench_dflash.sh --ckpt epoch_0_step_1000 |
| |
| # 组合 |
| bash /workspace/hanrui/syxin_old/run_bench_dflash.sh humaneval gsm8k --quick |
| ``` |
|
|
| ### 或者直接调 Python |
|
|
| ```bash |
| conda activate spec |
| |
| python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ |
| --benchmarks humaneval mtbench gsm8k \ |
| --block-size 16 \ |
| --max-new-tokens 512 \ |
| --temperature 0.0 |
| ``` |
|
|
| --- |
|
|
| ## 结果文件说明 |
|
|
| 结果保存在 `results/` 下,文件名示例: |
| ``` |
| dflash_lora_inject_offline_epoch_3_step_1400_20260314_150000.json |
| ``` |
|
|
| ```json |
| { |
| "model": "dflash-lora-inject/epoch_3_step_1400", |
| "block_size": 16, |
| "humaneval": { |
| "avg_accept_length": 3.42, |
| "total_tokens": 28500, |
| "latency": 120.5, |
| "throughput": 236.5, |
| "num_samples": 164, |
| "num_verify_rounds": 8320 |
| }, |
| "mtbench": { ... }, |
| "gsm8k": { ... } |
| } |
| ``` |
|
|
| | 字段 | 含义 | |
| |---|---| |
| | `avg_accept_length` | **核心指标**:平均每次 verify 接受的 token 数(含 injection)。越高越好,`1.0` = draft 完全无效 | |
| | `total_tokens` | 总生成 token 数 | |
| | `throughput` | tokens/s(离线评测,不含 batching 优化) | |
| | `num_verify_rounds` | 总验证轮数 | |
|
|
| --- |
|
|
| ## 对比 baseline |
|
|
| 对比未经 LoRA 训练的原始 Qwen3-8B 当 draft 的 accept_length: |
| |
| ```bash |
| python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ |
| --merged-path /workspace/models/Qwen3-8B \ |
| --benchmarks humaneval mtbench gsm8k \ |
| --num-samples 50 |
| ``` |
| |
| > 这会用原始 Qwen3-8B 同时当 target 和 draft(带 injection), |
| > 对比 LoRA 训练前后 accept_length 是否有提升。 |
|
|
| --- |
|
|
| ## 如何测其他 checkpoint |
|
|
| ```bash |
| # 方法 1:直接加载 adapter(自动 merge,不保存) |
| python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ |
| --ckpt epoch_0_step_1000 \ |
| --benchmarks humaneval --num-samples 50 |
| |
| # 方法 2:预先 merge 到不同目录 |
| python3 -c " |
| from peft import PeftModel |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch, os |
| BASE = '/workspace/models/Qwen3-8B' |
| ADAPTER = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject/epoch_0_step_1000' |
| MERGED = '/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000' |
| model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map='cpu') |
| model = PeftModel.from_pretrained(model, ADAPTER).merge_and_unload() |
| os.makedirs(MERGED, exist_ok=True) |
| model.save_pretrained(MERGED, safe_serialization=True) |
| AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED) |
| " |
| |
| python3 /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py \ |
| --merged-path .../qwen3-8b-dflash-lora-inject-merged-epoch_0_step_1000 \ |
| --benchmarks humaneval --num-samples 50 |
| ``` |
|
|
| 可用 checkpoint:`epoch_0_step_500` / `epoch_0_step_1000` / `epoch_0_step_1400` / `epoch_2_step_34500` / `epoch_2_step_35000` / `epoch_3_step_1400` |
|
|
| --- |
|
|
| ## 常见问题 |
|
|
| ### Q1:accept_length 和 STANDALONE 模式下差不多(都 ≈ 4.7) |
| |
| 这说明 layer injection 没有真正起作用。检查: |
| - 评测脚本确实用的是 `eval_dflash_lora_inject.py`(离线),不是 sglang bench |
| - merged 模型确实是 LoRA-Inject 版本(不是原始 Qwen3-8B) |
|
|
| ### Q2:OOM(单卡放不下两个 8B 模型) |
|
|
| 两个 bf16 的 Qwen3-8B ≈ 32GB,单卡 H100 80GB 够用。如果 OOM: |
| - 检查是否有其他进程占用显存 |
| - 减小 `--max-new-tokens`(试 256) |
| - 减小 `--num-samples` |
|
|
| ### Q3:数据集下载失败(无外网) |
|
|
| 评测脚本优先读本地文件: |
|
|
| | bench | 本地文件 | |
| |---|---| |
| | GSM8K | `/workspace/hanrui/datasets/gsm8k/test.jsonl` | |
| | MT-Bench | `/workspace/hanrui/datasets/mtbench/question.jsonl` | |
| | HumanEval | `/workspace/hanrui/datasets/humaneval/test.jsonl` | |
|
|
| --- |
|
|
| *基座:`/workspace/models/Qwen3-8B` | 最终 ckpt:`epoch_3_step_1400` | block_size:16* |
|
|