Chunity/Qwen3.6-35B-A3B-AutoRound-AWQ-4bit
AutoRound 4-bit AWQ quantization of Qwen/Qwen3.6-35B-A3B.
Quantization Summary
- Base model:
Qwen/Qwen3.6-35B-A3B - Quantization: AutoRound -> AWQ
- Scheme:
W4A16 - Bits:
4 - Group size:
128 - Iterations:
500 - Output format:
auto_awq
What Was Quantized
This checkpoint keeps the multimodal stack intact and focuses AWQ quantization on the language-model blocks.
Quantized:
model.language_model.layers
Left unquantized where required for functional runtime compatibility:
lm_headlinear_attn.*self_attn.*on the full-attention layersmlp.shared_expert.*mlp.shared_expert_gate- visual tower and merger modules
- MTP tensors were preserved in
model_extra_tensors.safetensors
Runtime Notes
This checkpoint was validated on a recent vLLM build that loads it through the awq_marlin path.
Environment used for validation:
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
Example:
from vllm import LLM
llm = LLM(
model="Chunity/Qwen3.6-35B-A3B-AutoRound-AWQ-4bit",
trust_remote_code=True,
max_model_len=256,
gpu_memory_utilization=0.95,
max_num_seqs=1,
language_model_only=True,
)
Validation
The checkpoint was loaded and exercised with vLLM 0.19.1.
Observed:
- loads successfully as AWQ (
awq_marlin) - coherent factual generation works
- this model family may still emit reasoning-style
<think>output depending on prompt formatting and runtime settings
Files
The repo includes:
- AWQ weight shards
config.jsonquantization_config.json- tokenizer and processor files
model_extra_tensors.safetensorsfor preserved non-exported tensors
Caveat
This is a mixed FP/AWQ export tailored to Qwen3.6's hybrid-attention MoE architecture. The quantization intentionally does not compress every submodule.
- Downloads last month
- 13,373
Model tree for Chunity/Qwen3.6-35B-A3B-AutoRound-AWQ-4bit
Base model
Qwen/Qwen3.6-35B-A3B