| # Accept Length 测试指南 |
|
|
| ## 0. 准备工作 |
|
|
| ### 创建目录 |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| mkdir -p logs results |
| ``` |
|
|
| ### 下载数据集(首次运行) |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| python download_datasets.py |
| ``` |
|
|
| 数据保存位置: |
| - MT-Bench: `/workspace/hanrui/datasets/mtbench/question.jsonl` |
| - GSM8K: `/workspace/hanrui/datasets/gsm8k/test.jsonl` |
| - HumanEval: `/workspace/hanrui/datasets/humaneval/test.jsonl` |
|
|
| --- |
|
|
| ## 1. 测试 Baseline 模型 |
|
|
| ### 启动服务器(终端1) |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| |
| # 设置环境变量 |
| export NO_PROXY="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| export no_proxy="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| |
| # 启动 baseline 服务器 |
| python3 -m sglang.launch_server \ |
| --model /workspace/Qwen3-8B \ |
| --speculative-algorithm EAGLE3 \ |
| --speculative-draft-model-path /workspace/qwen3_8b_eagle3 \ |
| --speculative-num-steps 3 \ |
| --speculative-eagle-topk 1 \ |
| --speculative-num-draft-tokens 4 \ |
| --mem-fraction-static 0.75 \ |
| --cuda-graph-max-bs 1 \ |
| --tp 1 \ |
| --trust-remote-code \ |
| --host 0.0.0.0 \ |
| --port 30000 \ |
| --dtype bfloat16 \ |
| --skip-server-warmup |
| ``` |
|
|
| 等待看到 `Application startup complete` 后,继续下一步。 |
|
|
| ### 运行三个 Benchmark(终端2) |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| conda activate /workspace/Hanrui/ |
| |
| # 设置环境变量 |
| export NO_PROXY="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| export no_proxy="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| |
| # 1. MT-Bench |
| echo "=== Running MT-Bench (Baseline) ===" |
| python benchmarks/bench_eagle3.py \ |
| --model-path /workspace/Qwen3-8B \ |
| --host 10.1.1.31 \ |
| --port 30000 \ |
| --config-list 1,3,1,4 \ |
| --benchmark-list mtbench:80 \ |
| --dtype bfloat16 \ |
| --skip-launch-server \ |
| --name baseline_mtbench \ |
| --output-dir ./results \ |
| 2>&1 | tee logs/baseline_mtbench_$(date +%Y%m%d_%H%M%S).log |
| |
| # 2. GSM8K |
| echo "=== Running GSM8K (Baseline) ===" |
| python benchmarks/bench_eagle3.py \ |
| --model-path /workspace/Qwen3-8B \ |
| --host 10.1.1.31 \ |
| --port 30000 \ |
| --config-list 1,3,1,4 \ |
| --benchmark-list gsm8k:100 \ |
| --dtype bfloat16 \ |
| --skip-launch-server \ |
| --name baseline_gsm8k \ |
| --output-dir ./results \ |
| 2>&1 | tee logs/baseline_gsm8k_$(date +%Y%m%d_%H%M%S).log |
| |
| # 3. HumanEval |
| echo "=== Running HumanEval (Baseline) ===" |
| python benchmarks/bench_eagle3.py \ |
| --model-path /workspace/Qwen3-8B \ |
| --host 10.1.1.31 \ |
| --port 30000 \ |
| --config-list 1,3,1,4 \ |
| --benchmark-list humaneval:164 \ |
| --dtype bfloat16 \ |
| --skip-launch-server \ |
| --name baseline_humaneval \ |
| --output-dir ./results \ |
| 2>&1 | tee logs/baseline_humaneval_$(date +%Y%m%d_%H%M%S).log |
| |
| echo "=== Baseline 测试完成 ===" |
| ``` |
|
|
| --- |
|
|
| ## 2. 测试训练后的模型 |
|
|
| ### 停止 Baseline 服务器并启动训练后的服务器(终端1) |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| |
| # 停止旧服务器 |
| pkill -f "sglang.launch_server" |
| sleep 5 |
| |
| # 设置环境变量 |
| export NO_PROXY="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| export no_proxy="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| |
| # 启动训练后的服务器 |
| python3 -m sglang.launch_server \ |
| --model /workspace/Qwen3-8B \ |
| --speculative-algorithm EAGLE3 \ |
| --speculative-draft-model-path /workspace/hanrui/SpecForge-ext/outputs/qwen3-8b-qwen3eagle-5layer/epoch_9_step_12310 \ |
| --speculative-num-steps 3 \ |
| --speculative-eagle-topk 1 \ |
| --speculative-num-draft-tokens 4 \ |
| --mem-fraction-static 0.75 \ |
| --cuda-graph-max-bs 1 \ |
| --tp 1 \ |
| --trust-remote-code \ |
| --host 0.0.0.0 \ |
| --port 30000 \ |
| --dtype bfloat16 \ |
| --skip-server-warmup |
| ``` |
|
|
| 等待看到 `Application startup complete` 后,继续下一步。 |
|
|
| ### 运行三个 Benchmark(终端2) |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| conda activate /workspace/Hanrui/ |
| |
| # 设置环境变量 |
| export NO_PROXY="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| export no_proxy="localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16" |
| |
| # 1. MT-Bench |
| echo "=== Running MT-Bench (Trained) ===" |
| python benchmarks/bench_eagle3.py \ |
| --model-path /workspace/Qwen3-8B \ |
| --host 10.1.1.31 \ |
| --port 30000 \ |
| --config-list 1,3,1,4 \ |
| --benchmark-list mtbench:80 \ |
| --dtype bfloat16 \ |
| --skip-launch-server \ |
| --name trained_mtbench \ |
| --output-dir ./results \ |
| 2>&1 | tee logs/trained_mtbench_$(date +%Y%m%d_%H%M%S).log |
| |
| # 2. GSM8K |
| echo "=== Running GSM8K (Trained) ===" |
| python benchmarks/bench_eagle3.py \ |
| --model-path /workspace/Qwen3-8B \ |
| --host 10.1.1.31 \ |
| --port 30000 \ |
| --config-list 1,3,1,4 \ |
| --benchmark-list gsm8k:100 \ |
| --dtype bfloat16 \ |
| --skip-launch-server \ |
| --name trained_gsm8k \ |
| --output-dir ./results \ |
| 2>&1 | tee logs/trained_gsm8k_$(date +%Y%m%d_%H%M%S).log |
| |
| # 3. HumanEval |
| echo "=== Running HumanEval (Trained) ===" |
| python benchmarks/bench_eagle3.py \ |
| --model-path /workspace/Qwen3-8B \ |
| --host 10.1.1.31 \ |
| --port 30000 \ |
| --config-list 1,3,1,4 \ |
| --benchmark-list humaneval:164 \ |
| --dtype bfloat16 \ |
| --skip-launch-server \ |
| --name trained_humaneval \ |
| --output-dir ./results \ |
| 2>&1 | tee logs/trained_humaneval_$(date +%Y%m%d_%H%M%S).log |
| |
| echo "=== Trained 测试完成 ===" |
| ``` |
|
|
| --- |
|
|
| ## 3. 查看结果 |
|
|
| ### 日志文件位置 |
| 所有日志保存在:`/workspace/hanrui/SpecForge-ext/logs/` |
| - `baseline_mtbench_*.log` |
| - `baseline_gsm8k_*.log` |
| - `baseline_humaneval_*.log` |
| - `trained_mtbench_*.log` |
| - `trained_gsm8k_*.log` |
| - `trained_humaneval_*.log` |
|
|
| 所有结果保存在:`/workspace/hanrui/SpecForge-ext/results/` |
| - `baseline_mtbench_*.jsonl` |
| - `baseline_gsm8k_*.jsonl` |
| - `baseline_humaneval_*.jsonl` |
| - `trained_mtbench_*.jsonl` |
| - `trained_gsm8k_*.jsonl` |
| - `trained_humaneval_*.jsonl` |
|
|
| ### 生成对比报告 |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| |
| python3 << 'EOF' |
| import json |
| import glob |
| |
| print("=" * 80) |
| print("Accept Length 对比报告") |
| print("=" * 80) |
| |
| datasets = ['mtbench', 'gsm8k', 'humaneval'] |
| |
| for dataset in datasets: |
| print(f"\n{'=' * 80}") |
| print(f"{dataset.upper()} 结果对比") |
| print('=' * 80) |
| |
| baseline_files = sorted(glob.glob(f'results/baseline_{dataset}_*.jsonl')) |
| trained_files = sorted(glob.glob(f'results/trained_{dataset}_*.jsonl')) |
| |
| if not baseline_files or not trained_files: |
| print(f" 未找到 {dataset} 的结果文件") |
| continue |
| |
| with open(baseline_files[-1], 'r') as f: |
| baseline = json.load(f) |
| |
| with open(trained_files[-1], 'r') as f: |
| trained = json.load(f) |
| |
| baseline_metrics = baseline[dataset][0]['metrics'][0] |
| trained_metrics = trained[dataset][0]['metrics'][0] |
| |
| print(f"\nBaseline:") |
| print(f" Accept Length: {baseline_metrics['accept_length']:.4f}") |
| print(f" Output Throughput: {baseline_metrics['output_throughput']:.2f} tokens/s") |
| if 'accuracy' in baseline_metrics and baseline_metrics['accuracy'] is not None: |
| print(f" Accuracy: {baseline_metrics['accuracy']:.2%}") |
| |
| print(f"\nTrained:") |
| print(f" Accept Length: {trained_metrics['accept_length']:.4f}") |
| print(f" Output Throughput: {trained_metrics['output_throughput']:.2f} tokens/s") |
| if 'accuracy' in trained_metrics and trained_metrics['accuracy'] is not None: |
| print(f" Accuracy: {trained_metrics['accuracy']:.2%}") |
| |
| accept_diff = trained_metrics['accept_length'] - baseline_metrics['accept_length'] |
| accept_pct = (accept_diff / baseline_metrics['accept_length']) * 100 |
| |
| throughput_diff = trained_metrics['output_throughput'] - baseline_metrics['output_throughput'] |
| throughput_pct = (throughput_diff / baseline_metrics['output_throughput']) * 100 |
| |
| print(f"\n差异:") |
| print(f" Accept Length: {accept_diff:+.4f} ({accept_pct:+.2f}%)") |
| print(f" Throughput: {throughput_diff:+.2f} tokens/s ({throughput_pct:+.2f}%)") |
| |
| if 'accuracy' in baseline_metrics and baseline_metrics['accuracy'] is not None: |
| acc_diff = trained_metrics['accuracy'] - baseline_metrics['accuracy'] |
| acc_pct = acc_diff * 100 |
| print(f" Accuracy: {acc_pct:+.2f} percentage points") |
| |
| print("\n" + "=" * 80) |
| EOF |
| ``` |
|
|
| --- |
|
|
| ## 4. 快速查看单个结果 |
| ```bash |
| cd /workspace/hanrui/SpecForge-ext |
| |
| # 查看 baseline 的 accept_length |
| cat results/baseline_mtbench_*.jsonl | jq '.mtbench[0].metrics[0].accept_length' |
| cat results/baseline_gsm8k_*.jsonl | jq '.gsm8k[0].metrics[0].accept_length' |
| cat results/baseline_humaneval_*.jsonl | jq '.humaneval[0].metrics[0].accept_length' |
| |
| # 查看 trained 的 accept_length |
| cat results/trained_mtbench_*.jsonl | jq '.mtbench[0].metrics[0].accept_length' |
| cat results/trained_gsm8k_*.jsonl | jq '.gsm8k[0].metrics[0].accept_length' |
| cat results/trained_humaneval_*.jsonl | jq '.humaneval[0].metrics[0].accept_length' |
| ``` |
|
|