--- library_name: transformers tags: - qwen3.5 - awq - speculative-decoding - eagle3 - sglang --- # EQAQ v2 EQAQ v2 is the EQC Qwen3.5 4B text-only AWQ target model package used with SGLang, plus the EAGLE3 draft models used in the local speculative decoding experiments. Repository layout: ```text . |-- config.json |-- model-00001-of-00001.safetensors |-- model.safetensors.index.json |-- tokenizer.json |-- tokenizer_config.json |-- vocab.json |-- merges.txt |-- chat_template.jinja `-- drafts/ |-- q028-fast-sglangcompat/ `-- q004-chatthink-sglangcompat/ ``` The root model is the target model. The draft directories are EAGLE3 draft models for SGLang speculative decoding and are not standalone target models. ## Expected Performance These numbers are local measurements from the EQC competition protocol harness, not an official leaderboard score. The official submission uploaded successfully, but the evaluation job failed before scoring because the service could not provision the requested ML compute capacity. Recommended route setup for the measured run: - Target model: repository root AWQ model - Latency, MMLU-Pro, IFEval draft: `drafts/q028-fast-sglangcompat` - GPQA/thinking draft: `drafts/q004-chatthink-sglangcompat` - SGLang speculative decoding: EAGLE3, `speculative-num-steps=10`, `speculative-eagle-topk=2`, `speculative-num-draft-tokens=20` ### Local latency Measured with the EQC latency request shape: `/v1/completions`, logical batch size 1, 5 warmup runs, 50 measurement runs per category. The speedup below is computed against a target-only run measured on the same local machine, not against the fixed baseline constants embedded in the EQC protocol harness. | Category | Prompt / new tokens | Target-only median | EQAQ v2 median | Local speedup | |---|---:|---:|---:|---:| | short | 64 / 128 | 852.58 ms | 228.87 ms | 3.73x | | medium | 2048 / 256 | 1771.02 ms | 475.62 ms | 3.72x | | long | 8192 / 256 | 2179.81 ms | 847.43 ms | 2.57x | Average local speedup was **3.10x** using the average of category medians (`1601.14 ms / 517.31 ms`). The older **9.41x** figure comes from dividing by the EQC harness fixed baseline constants (`2582/5441/6576 ms`) and should not be interpreted as a speedup over a baseline measured on this machine. A submission-aligned smoke run with a more conservative single-image setup measured about **4.39x** against the same fixed protocol constants over 3 runs per category; it is included only as a packaging/protocol smoke result, not as the local target-only speedup. Baseline caveat: the target-only no-spec SGLang server crashed with the default piecewise CUDA graph path (`NoneType mrope_positions`), so the local target-only baseline was measured with `--disable-piecewise-cuda-graph` while keeping the same target model, endpoint, prompt/token protocol, CUDA graph batch sizes, and core SGLang serving options. Observed speculative accept rate in the active local SGLang run was low, roughly **6%** over recent decode batches, so the latency gain should be understood as the combined effect of SGLang serving settings, CUDA graph, and speculative decoding rather than high draft acceptance alone. ### Local quality Measured in the same local full protocol run: | Benchmark | Metric | Score | Gate | |---|---|---:|---:| | MMLU-Pro | exact_match, custom-extract | 0.6525 | 0.621 | | IFEval | inst_level_strict_acc | 0.8106 | 0.814 | | GPQA-Diamond | exact_match, flexible-extract | 0.4293 | 0.630 | The local run passed the latency gate and MMLU-Pro, but did **not** pass the full quality gate because IFEval was slightly below threshold and GPQA-Diamond was substantially below threshold. Treat this package as a speed-oriented EQC artifact, not a confirmed quality-passing competition submission. Expected SGLang usage shape: ```bash python -m sglang.launch_server \ --model-path \ --tokenizer-path \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path /drafts/q028-fast-sglangcompat \ --speculative-draft-model-quantization unquant \ --speculative-num-steps 10 \ --speculative-eagle-topk 2 \ --speculative-num-draft-tokens 20 ``` Local source artifacts: - Target: `/home/project-a/efficient-qwen/models/qwen35-4b-awq-text-only-sglang-compat` - q028 draft: `/home/ubuntu/EQC/artifacts/eagle3/q028_q018_step120_long_steps10_lr5e7_20260522T073503Z/models/Qwen3.5-4B-TextOnly-EAGLE3-Q028-Q018Step120-LongSteps10-LR5e7-SGLangCompat` - q004 draft: `/home/ubuntu/EQC/artifacts/eagle3/q004_modesplit_20260521-q004-chatthink-reuse-a/models/Qwen3.5-4B-TextOnly-EAGLE3-Q004-ChatThink-SGLangCompat`