Instructions to use NotaMG/eqaq-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NotaMG/eqaq-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NotaMG/eqaq-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NotaMG/eqaq-v2")
model = AutoModelForCausalLM.from_pretrained("NotaMG/eqaq-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use NotaMG/eqaq-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NotaMG/eqaq-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NotaMG/eqaq-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NotaMG/eqaq-v2

SGLang

How to use NotaMG/eqaq-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NotaMG/eqaq-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NotaMG/eqaq-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NotaMG/eqaq-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NotaMG/eqaq-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use NotaMG/eqaq-v2 with Docker Model Runner:
```
docker model run hf.co/NotaMG/eqaq-v2
```

eqaq-v2 / README.md

NotaMG

Correct speedup to local target-only baseline

fd70217 verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

4.71 kB

	---
	library_name: transformers
	tags:
	- qwen3.5
	- awq
	- speculative-decoding
	- eagle3
	- sglang
	---

	# EQAQ v2

	EQAQ v2 is the EQC Qwen3.5 4B text-only AWQ target model package used with
	SGLang, plus the EAGLE3 draft models used in the local speculative decoding
	experiments.

	Repository layout:

	```text
	.
	\|-- config.json
	\|-- model-00001-of-00001.safetensors
	\|-- model.safetensors.index.json
	\|-- tokenizer.json
	\|-- tokenizer_config.json
	\|-- vocab.json
	\|-- merges.txt
	\|-- chat_template.jinja
	`-- drafts/
	\|-- q028-fast-sglangcompat/
	`-- q004-chatthink-sglangcompat/
	```

	The root model is the target model. The draft directories are EAGLE3 draft
	models for SGLang speculative decoding and are not standalone target models.

	## Expected Performance

	These numbers are local measurements from the EQC competition protocol harness,
	not an official leaderboard score. The official submission uploaded
	successfully, but the evaluation job failed before scoring because the service
	could not provision the requested ML compute capacity.

	Recommended route setup for the measured run:

	- Target model: repository root AWQ model
	- Latency, MMLU-Pro, IFEval draft: `drafts/q028-fast-sglangcompat`
	- GPQA/thinking draft: `drafts/q004-chatthink-sglangcompat`
	- SGLang speculative decoding: EAGLE3, `speculative-num-steps=10`,
	`speculative-eagle-topk=2`, `speculative-num-draft-tokens=20`

	### Local latency

	Measured with the EQC latency request shape: `/v1/completions`, logical batch
	size 1, 5 warmup runs, 50 measurement runs per category.

	The speedup below is computed against a target-only run measured on the same
	local machine, not against the fixed baseline constants embedded in the EQC
	protocol harness.

	\| Category \| Prompt / new tokens \| Target-only median \| EQAQ v2 median \| Local speedup \|
	\|---\|---:\|---:\|---:\|---:\|
	\| short \| 64 / 128 \| 852.58 ms \| 228.87 ms \| 3.73x \|
	\| medium \| 2048 / 256 \| 1771.02 ms \| 475.62 ms \| 3.72x \|
	\| long \| 8192 / 256 \| 2179.81 ms \| 847.43 ms \| 2.57x \|

	Average local speedup was 3.10x using the average of category medians
	(`1601.14 ms / 517.31 ms`). The older 9.41x figure comes from dividing by
	the EQC harness fixed baseline constants (`2582/5441/6576 ms`) and should not
	be interpreted as a speedup over a baseline measured on this machine.

	A submission-aligned smoke run with a more conservative single-image setup
	measured about 4.39x against the same fixed protocol constants over 3 runs
	per category; it is included only as a packaging/protocol smoke result, not as
	the local target-only speedup.

	Baseline caveat: the target-only no-spec SGLang server crashed with the default
	piecewise CUDA graph path (`NoneType mrope_positions`), so the local
	target-only baseline was measured with `--disable-piecewise-cuda-graph` while
	keeping the same target model, endpoint, prompt/token protocol, CUDA graph
	batch sizes, and core SGLang serving options.

	Observed speculative accept rate in the active local SGLang run was low,
	roughly 6% over recent decode batches, so the latency gain should be
	understood as the combined effect of SGLang serving settings, CUDA graph, and
	speculative decoding rather than high draft acceptance alone.

	### Local quality

	Measured in the same local full protocol run:

	\| Benchmark \| Metric \| Score \| Gate \|
	\|---\|---\|---:\|---:\|
	\| MMLU-Pro \| exact_match, custom-extract \| 0.6525 \| 0.621 \|
	\| IFEval \| inst_level_strict_acc \| 0.8106 \| 0.814 \|
	\| GPQA-Diamond \| exact_match, flexible-extract \| 0.4293 \| 0.630 \|

	The local run passed the latency gate and MMLU-Pro, but did not pass the
	full quality gate because IFEval was slightly below threshold and GPQA-Diamond
	was substantially below threshold. Treat this package as a speed-oriented EQC
	artifact, not a confirmed quality-passing competition submission.

	Expected SGLang usage shape:

	```bash
	python -m sglang.launch_server \
	--model-path <local-snapshot-of-this-repo> \
	--tokenizer-path <local-snapshot-of-this-repo> \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path <local-snapshot-of-this-repo>/drafts/q028-fast-sglangcompat \
	--speculative-draft-model-quantization unquant \
	--speculative-num-steps 10 \
	--speculative-eagle-topk 2 \
	--speculative-num-draft-tokens 20
	```

	Local source artifacts:

	- Target: `/home/project-a/efficient-qwen/models/qwen35-4b-awq-text-only-sglang-compat`
	- q028 draft: `/home/ubuntu/EQC/artifacts/eagle3/q028_q018_step120_long_steps10_lr5e7_20260522T073503Z/models/Qwen3.5-4B-TextOnly-EAGLE3-Q028-Q018Step120-LongSteps10-LR5e7-SGLangCompat`
	- q004 draft: `/home/ubuntu/EQC/artifacts/eagle3/q004_modesplit_20260521-q004-chatthink-reuse-a/models/Qwen3.5-4B-TextOnly-EAGLE3-Q004-ChatThink-SGLangCompat`