eqaq-v2 / README.md
NotaMG's picture
Correct speedup to local target-only baseline
fd70217 verified
|
Raw
History Blame Contribute Delete
4.71 kB
---
library_name: transformers
tags:
- qwen3.5
- awq
- speculative-decoding
- eagle3
- sglang
---
# EQAQ v2
EQAQ v2 is the EQC Qwen3.5 4B text-only AWQ target model package used with
SGLang, plus the EAGLE3 draft models used in the local speculative decoding
experiments.
Repository layout:
```text
.
|-- config.json
|-- model-00001-of-00001.safetensors
|-- model.safetensors.index.json
|-- tokenizer.json
|-- tokenizer_config.json
|-- vocab.json
|-- merges.txt
|-- chat_template.jinja
`-- drafts/
|-- q028-fast-sglangcompat/
`-- q004-chatthink-sglangcompat/
```
The root model is the target model. The draft directories are EAGLE3 draft
models for SGLang speculative decoding and are not standalone target models.
## Expected Performance
These numbers are local measurements from the EQC competition protocol harness,
not an official leaderboard score. The official submission uploaded
successfully, but the evaluation job failed before scoring because the service
could not provision the requested ML compute capacity.
Recommended route setup for the measured run:
- Target model: repository root AWQ model
- Latency, MMLU-Pro, IFEval draft: `drafts/q028-fast-sglangcompat`
- GPQA/thinking draft: `drafts/q004-chatthink-sglangcompat`
- SGLang speculative decoding: EAGLE3, `speculative-num-steps=10`,
`speculative-eagle-topk=2`, `speculative-num-draft-tokens=20`
### Local latency
Measured with the EQC latency request shape: `/v1/completions`, logical batch
size 1, 5 warmup runs, 50 measurement runs per category.
The speedup below is computed against a target-only run measured on the same
local machine, not against the fixed baseline constants embedded in the EQC
protocol harness.
| Category | Prompt / new tokens | Target-only median | EQAQ v2 median | Local speedup |
|---|---:|---:|---:|---:|
| short | 64 / 128 | 852.58 ms | 228.87 ms | 3.73x |
| medium | 2048 / 256 | 1771.02 ms | 475.62 ms | 3.72x |
| long | 8192 / 256 | 2179.81 ms | 847.43 ms | 2.57x |
Average local speedup was **3.10x** using the average of category medians
(`1601.14 ms / 517.31 ms`). The older **9.41x** figure comes from dividing by
the EQC harness fixed baseline constants (`2582/5441/6576 ms`) and should not
be interpreted as a speedup over a baseline measured on this machine.
A submission-aligned smoke run with a more conservative single-image setup
measured about **4.39x** against the same fixed protocol constants over 3 runs
per category; it is included only as a packaging/protocol smoke result, not as
the local target-only speedup.
Baseline caveat: the target-only no-spec SGLang server crashed with the default
piecewise CUDA graph path (`NoneType mrope_positions`), so the local
target-only baseline was measured with `--disable-piecewise-cuda-graph` while
keeping the same target model, endpoint, prompt/token protocol, CUDA graph
batch sizes, and core SGLang serving options.
Observed speculative accept rate in the active local SGLang run was low,
roughly **6%** over recent decode batches, so the latency gain should be
understood as the combined effect of SGLang serving settings, CUDA graph, and
speculative decoding rather than high draft acceptance alone.
### Local quality
Measured in the same local full protocol run:
| Benchmark | Metric | Score | Gate |
|---|---|---:|---:|
| MMLU-Pro | exact_match, custom-extract | 0.6525 | 0.621 |
| IFEval | inst_level_strict_acc | 0.8106 | 0.814 |
| GPQA-Diamond | exact_match, flexible-extract | 0.4293 | 0.630 |
The local run passed the latency gate and MMLU-Pro, but did **not** pass the
full quality gate because IFEval was slightly below threshold and GPQA-Diamond
was substantially below threshold. Treat this package as a speed-oriented EQC
artifact, not a confirmed quality-passing competition submission.
Expected SGLang usage shape:
```bash
python -m sglang.launch_server \
--model-path <local-snapshot-of-this-repo> \
--tokenizer-path <local-snapshot-of-this-repo> \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path <local-snapshot-of-this-repo>/drafts/q028-fast-sglangcompat \
--speculative-draft-model-quantization unquant \
--speculative-num-steps 10 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 20
```
Local source artifacts:
- Target: `/home/project-a/efficient-qwen/models/qwen35-4b-awq-text-only-sglang-compat`
- q028 draft: `/home/ubuntu/EQC/artifacts/eagle3/q028_q018_step120_long_steps10_lr5e7_20260522T073503Z/models/Qwen3.5-4B-TextOnly-EAGLE3-Q028-Q018Step120-LongSteps10-LR5e7-SGLangCompat`
- q004 draft: `/home/ubuntu/EQC/artifacts/eagle3/q004_modesplit_20260521-q004-chatthink-reuse-a/models/Qwen3.5-4B-TextOnly-EAGLE3-Q004-ChatThink-SGLangCompat`