- JoyAI-VL-Interaction-Preview-AWQ
JoyAI-VL-Interaction-Preview-AWQ
AWQ W4A16 post-training quantization (PTQ) of jdopensource/JoyAI-VL-Interaction-Preview.
This model card focuses on how the quantized checkpoint was produced, not on how to consume the model (for inference usage please refer to the base model card).
Base Model
| Attribute | Value |
|---|---|
| Base model | jdopensource/JoyAI-VL-Interaction-Preview |
| Architecture | Qwen3VLForConditionalGeneration |
| Scale | ~8B parameters |
| Original dtype | bfloat16 |
Quantization Method
| Attribute | Value |
|---|---|
| Method | AWQ — Activation-aware Weight Quantization |
| Tool | llmcompressor |
| API used | llmcompressor.oneshot() with AWQModifier |
| Quantization scheme | W4A16 (4-bit weights, 16-bit activations) |
| Format | compressed-tensors (quant_method: "compressed-tensors") |
| Quantization status | compressed |
| Group size | 128 |
| Weight packing | pack-quantized (INT4 packed into INT32 containers) |
| Duo scaling | Enabled (duo_scaling=True) |
| AWQ grid search size | n_grid: 20 |
| Symmetric weights | Yes |
| Smoothing mappings | Auto-inferred by llmcompressor for Qwen3VLForConditionalGeneration |
Modules Targeted and Excluded
- Targeted: all
Linearlayers in the text model. - Excluded from quantization:
lm_head- All vision components:
visual,vision_tower,vision_model,vision_proj,merger
The vision encoder and language model head were intentionally kept in higher precision to preserve visual understanding quality and output embedding fidelity.
The exact excluded module list recorded in config.json is:
["model.visual.blocks.0.attn.qkv", "model.visual.blocks.0.attn.proj", ...]
(fully expanded in the repository's config.json under quantization_config.ignore)
Quantization Recipe (as saved in recipe.yaml)
default_stage:
default_modifiers:
AWQModifier:
mappings:
- smooth_layer: re:.*input_layernorm$
balance_layers: ['re:.*q_proj$', 're:.*k_proj$', 're:.*v_proj$']
- smooth_layer: re:.*v_proj$
balance_layers: ['re:.*o_proj$']
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
- smooth_layer: re:.*up_proj$
balance_layers: ['re:.*down_proj$']
duo_scaling: true
n_grid: 20
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*visual.*', 're:.*vision_tower.*', 're:.*vision_model.*',
're:.*vision_proj.*', 're:.*merger.*']
scheme: W4A16
bypass_divisibility_checks: false
llmcompressor internally split the request into an AWQ smoothing step followed by a QuantizationModifier step.
Calibration Dataset
Because the original jdopensource/JoyAI-VL-Interaction dataset only contains annotation JSON files and not the actual video assets, calibration was performed on a publicly available video-question-answering dataset that ships with real .mp4 files.
| Attribute | Value |
|---|---|
| Dataset | MBZUAI/VCGBench-Diverse |
| Videos available | 877 unique .mp4 files |
| QA pairs available | 4,354 entries in vcgbench_diverse_qa.json |
| Samples used | 128 QA pairs |
| Sampling seed | 42 |
| Filtering rule | Only QA entries whose referenced video file actually exists were kept |
Preprocessing Pipeline
For each sampled QA pair:
- Open the referenced
.mp4with OpenCV. - Extract the first video frame.
- Convert the frame to a PIL
RGBimage. - Build a single-turn user message:
[ {"type": "image"}, {"type": "text", "text": "<question text>"} ] - Apply the Qwen3-VL chat template (
add_generation_prompt=True). - Process through the base model's
AutoProcessorwithmax_length=2048andtruncation=True.
The calibration batch therefore consists of image + text tokenized inputs matching the model's expected multimodal format.
Hardware and Software Environment
| Attribute | Value |
|---|---|
| GPU | NVIDIA RTX 5090 (Blackwell, SM 120, 32 GB) |
| OS | Windows |
| CUDA | 13.1 runtime driver; PyTorch cu128 wheel |
| Python | 3.10 |
| PyTorch | 2.11.0+cu128 |
| torchvision | 0.26.0+cu128 |
| Key packages | transformers, llmcompressor, datasets, opencv-python, Pillow |
Notes on the Runtime
expandable_segmentsis not supported by the CUDA allocator on Windows, so this option had no effect and was left as a harmless no-op.- The quantization completed successfully on a single RTX 5090 without model sharding.
Quantization Run Log
Key observations from the run:
| Stage | GPU Memory |
|---|---|
| Original BF16 model loaded | ~16.33 GB |
| After AWQ W4A16 applied | ~6.74 GB |
- AWQ executed 37 sequential subgraphs.
- All calibration, smoothing, and scale propagation steps completed without errors.
- Model was saved with
save_compressed=True, producing thecompressed-tensorscheckpoint layout.
Checkpoint Contents and Size
| File | Size |
|---|---|
model.safetensors |
~6.8 GB |
config.json |
~7.9 KB |
tokenizer.json |
~11 MB |
chat_template.jinja |
~5.3 KB |
recipe.yaml |
~1.4 KB |
generation_config.json, processor_config.json, tokenizer_config.json |
small metadata |
| Total | ~6.8 GB |
Format Notes
This checkpoint is not in legacy AutoAWQ format (quant_method: "awq"). It uses the compressed-tensors format produced directly by llmcompressor:
"quantization_config": {
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"format": "pack-quantized",
...
}
Compatibility with downstream engines:
transformerscan load it as long ascompressed-tensorsis installed.vLLMsupport depends on the vLLM version understandingquant_method: "compressed-tensors"for Qwen3-VL; use a recent release and verify before deploying.
Known Limitations and Caveats
- Post-training quantization (PTQ) always introduces some accuracy loss compared to the original BF16 checkpoint. No downstream benchmark evaluation was performed on this specific quantized checkpoint yet.
- Deprecated API: the
AWQModifierused in this run is the legacy compatibility shim inllmcompressor. Newer releases recommend replacing it withAWQTransformModifierfollowed byQuantizationModifier. - The vision encoder was excluded from quantization, so memory savings come almost entirely from the language-model weights.
- Calibration was done on a general-domain academic video QA dataset, not on the original JoyAI-VL-Interaction videos. If you deploy this model for the exact domain the base model was tuned for, you may want to re-quantize with domain-specific calibration data.
Reproducibility
To reproduce this quantization from the base model:
- Download or clone
MBZUAI/VCGBench-Diverseand extract thevideos/directory next tovcgbench_diverse_qa.json. - Install the same environment (
torch==2.11.0+cu128,torchvision==0.26.0+cu128,llmcompressor,transformers,datasets,opencv-python,Pillow). - Run the quantization script with random seed 42, 128 samples,
max_seq_length=2048.
License
Same license as the base model jdopensource/JoyAI-VL-Interaction-Preview. Please refer to the base model card for the exact license terms.
- Downloads last month
- -
Model tree for openaiarka/JoyAI-VL-Interaction-Preview-AWQ
Base model
jdopensource/JoyAI-VL-Interaction-Preview