Text Generation
Transformers
Safetensors
qwen3_5_text
qwen3.5
awq
speculative-decoding
eagle3
sglang
conversational
compressed-tensors
Instructions to use NotaMG/eqaq-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NotaMG/eqaq-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NotaMG/eqaq-v2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("NotaMG/eqaq-v2") model = AutoModelForCausalLM.from_pretrained("NotaMG/eqaq-v2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use NotaMG/eqaq-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NotaMG/eqaq-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotaMG/eqaq-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/NotaMG/eqaq-v2
- SGLang
How to use NotaMG/eqaq-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NotaMG/eqaq-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotaMG/eqaq-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NotaMG/eqaq-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NotaMG/eqaq-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use NotaMG/eqaq-v2 with Docker Model Runner:
docker model run hf.co/NotaMG/eqaq-v2
| library_name: transformers | |
| tags: | |
| - qwen3.5 | |
| - awq | |
| - speculative-decoding | |
| - eagle3 | |
| - sglang | |
| # EQAQ v2 | |
| EQAQ v2 is the EQC Qwen3.5 4B text-only AWQ target model package used with | |
| SGLang, plus the EAGLE3 draft models used in the local speculative decoding | |
| experiments. | |
| Repository layout: | |
| ```text | |
| . | |
| |-- config.json | |
| |-- model-00001-of-00001.safetensors | |
| |-- model.safetensors.index.json | |
| |-- tokenizer.json | |
| |-- tokenizer_config.json | |
| |-- vocab.json | |
| |-- merges.txt | |
| |-- chat_template.jinja | |
| `-- drafts/ | |
| |-- q028-fast-sglangcompat/ | |
| `-- q004-chatthink-sglangcompat/ | |
| ``` | |
| The root model is the target model. The draft directories are EAGLE3 draft | |
| models for SGLang speculative decoding and are not standalone target models. | |
| ## Expected Performance | |
| These numbers are local measurements from the EQC competition protocol harness, | |
| not an official leaderboard score. The official submission uploaded | |
| successfully, but the evaluation job failed before scoring because the service | |
| could not provision the requested ML compute capacity. | |
| Recommended route setup for the measured run: | |
| - Target model: repository root AWQ model | |
| - Latency, MMLU-Pro, IFEval draft: `drafts/q028-fast-sglangcompat` | |
| - GPQA/thinking draft: `drafts/q004-chatthink-sglangcompat` | |
| - SGLang speculative decoding: EAGLE3, `speculative-num-steps=10`, | |
| `speculative-eagle-topk=2`, `speculative-num-draft-tokens=20` | |
| ### Local latency | |
| Measured with the EQC latency request shape: `/v1/completions`, logical batch | |
| size 1, 5 warmup runs, 50 measurement runs per category. | |
| The speedup below is computed against a target-only run measured on the same | |
| local machine, not against the fixed baseline constants embedded in the EQC | |
| protocol harness. | |
| | Category | Prompt / new tokens | Target-only median | EQAQ v2 median | Local speedup | | |
| |---|---:|---:|---:|---:| | |
| | short | 64 / 128 | 852.58 ms | 228.87 ms | 3.73x | | |
| | medium | 2048 / 256 | 1771.02 ms | 475.62 ms | 3.72x | | |
| | long | 8192 / 256 | 2179.81 ms | 847.43 ms | 2.57x | | |
| Average local speedup was **3.10x** using the average of category medians | |
| (`1601.14 ms / 517.31 ms`). The older **9.41x** figure comes from dividing by | |
| the EQC harness fixed baseline constants (`2582/5441/6576 ms`) and should not | |
| be interpreted as a speedup over a baseline measured on this machine. | |
| A submission-aligned smoke run with a more conservative single-image setup | |
| measured about **4.39x** against the same fixed protocol constants over 3 runs | |
| per category; it is included only as a packaging/protocol smoke result, not as | |
| the local target-only speedup. | |
| Baseline caveat: the target-only no-spec SGLang server crashed with the default | |
| piecewise CUDA graph path (`NoneType mrope_positions`), so the local | |
| target-only baseline was measured with `--disable-piecewise-cuda-graph` while | |
| keeping the same target model, endpoint, prompt/token protocol, CUDA graph | |
| batch sizes, and core SGLang serving options. | |
| Observed speculative accept rate in the active local SGLang run was low, | |
| roughly **6%** over recent decode batches, so the latency gain should be | |
| understood as the combined effect of SGLang serving settings, CUDA graph, and | |
| speculative decoding rather than high draft acceptance alone. | |
| ### Local quality | |
| Measured in the same local full protocol run: | |
| | Benchmark | Metric | Score | Gate | | |
| |---|---|---:|---:| | |
| | MMLU-Pro | exact_match, custom-extract | 0.6525 | 0.621 | | |
| | IFEval | inst_level_strict_acc | 0.8106 | 0.814 | | |
| | GPQA-Diamond | exact_match, flexible-extract | 0.4293 | 0.630 | | |
| The local run passed the latency gate and MMLU-Pro, but did **not** pass the | |
| full quality gate because IFEval was slightly below threshold and GPQA-Diamond | |
| was substantially below threshold. Treat this package as a speed-oriented EQC | |
| artifact, not a confirmed quality-passing competition submission. | |
| Expected SGLang usage shape: | |
| ```bash | |
| python -m sglang.launch_server \ | |
| --model-path <local-snapshot-of-this-repo> \ | |
| --tokenizer-path <local-snapshot-of-this-repo> \ | |
| --speculative-algorithm EAGLE3 \ | |
| --speculative-draft-model-path <local-snapshot-of-this-repo>/drafts/q028-fast-sglangcompat \ | |
| --speculative-draft-model-quantization unquant \ | |
| --speculative-num-steps 10 \ | |
| --speculative-eagle-topk 2 \ | |
| --speculative-num-draft-tokens 20 | |
| ``` | |
| Local source artifacts: | |
| - Target: `/home/project-a/efficient-qwen/models/qwen35-4b-awq-text-only-sglang-compat` | |
| - q028 draft: `/home/ubuntu/EQC/artifacts/eagle3/q028_q018_step120_long_steps10_lr5e7_20260522T073503Z/models/Qwen3.5-4B-TextOnly-EAGLE3-Q028-Q018Step120-LongSteps10-LR5e7-SGLangCompat` | |
| - q004 draft: `/home/ubuntu/EQC/artifacts/eagle3/q004_modesplit_20260521-q004-chatthink-reuse-a/models/Qwen3.5-4B-TextOnly-EAGLE3-Q004-ChatThink-SGLangCompat` | |