JoyAI-VL-Interaction-Preview-AWQ

AWQ W4A16 post-training quantization (PTQ) of jdopensource/JoyAI-VL-Interaction-Preview.

This model card focuses on how the quantized checkpoint was produced, not on how to consume the model (for inference usage please refer to the base model card).

Base Model

Attribute Value
Base model jdopensource/JoyAI-VL-Interaction-Preview
Architecture Qwen3VLForConditionalGeneration
Scale ~8B parameters
Original dtype bfloat16

Quantization Method

Attribute Value
Method AWQ — Activation-aware Weight Quantization
Tool llmcompressor
API used llmcompressor.oneshot() with AWQModifier
Quantization scheme W4A16 (4-bit weights, 16-bit activations)
Format compressed-tensors (quant_method: "compressed-tensors")
Quantization status compressed
Group size 128
Weight packing pack-quantized (INT4 packed into INT32 containers)
Duo scaling Enabled (duo_scaling=True)
AWQ grid search size n_grid: 20
Symmetric weights Yes
Smoothing mappings Auto-inferred by llmcompressor for Qwen3VLForConditionalGeneration

Modules Targeted and Excluded

  • Targeted: all Linear layers in the text model.
  • Excluded from quantization:
    • lm_head
    • All vision components: visual, vision_tower, vision_model, vision_proj, merger

The vision encoder and language model head were intentionally kept in higher precision to preserve visual understanding quality and output embedding fidelity.

The exact excluded module list recorded in config.json is:

["model.visual.blocks.0.attn.qkv", "model.visual.blocks.0.attn.proj", ...]

(fully expanded in the repository's config.json under quantization_config.ignore)

Quantization Recipe (as saved in recipe.yaml)

default_stage:
  default_modifiers:
    AWQModifier:
      mappings:
      - smooth_layer: re:.*input_layernorm$
        balance_layers: ['re:.*q_proj$', 're:.*k_proj$', 're:.*v_proj$']
      - smooth_layer: re:.*v_proj$
        balance_layers: ['re:.*o_proj$']
      - smooth_layer: re:.*post_attention_layernorm$
        balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
      - smooth_layer: re:.*up_proj$
        balance_layers: ['re:.*down_proj$']
      duo_scaling: true
      n_grid: 20
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head, 're:.*visual.*', 're:.*vision_tower.*', 're:.*vision_model.*',
               're:.*vision_proj.*', 're:.*merger.*']
      scheme: W4A16
      bypass_divisibility_checks: false

llmcompressor internally split the request into an AWQ smoothing step followed by a QuantizationModifier step.

Calibration Dataset

Because the original jdopensource/JoyAI-VL-Interaction dataset only contains annotation JSON files and not the actual video assets, calibration was performed on a publicly available video-question-answering dataset that ships with real .mp4 files.

Attribute Value
Dataset MBZUAI/VCGBench-Diverse
Videos available 877 unique .mp4 files
QA pairs available 4,354 entries in vcgbench_diverse_qa.json
Samples used 128 QA pairs
Sampling seed 42
Filtering rule Only QA entries whose referenced video file actually exists were kept

Preprocessing Pipeline

For each sampled QA pair:

  1. Open the referenced .mp4 with OpenCV.
  2. Extract the first video frame.
  3. Convert the frame to a PIL RGB image.
  4. Build a single-turn user message:
    [
      {"type": "image"},
      {"type": "text", "text": "<question text>"}
    ]
    
  5. Apply the Qwen3-VL chat template (add_generation_prompt=True).
  6. Process through the base model's AutoProcessor with max_length=2048 and truncation=True.

The calibration batch therefore consists of image + text tokenized inputs matching the model's expected multimodal format.

Hardware and Software Environment

Attribute Value
GPU NVIDIA RTX 5090 (Blackwell, SM 120, 32 GB)
OS Windows
CUDA 13.1 runtime driver; PyTorch cu128 wheel
Python 3.10
PyTorch 2.11.0+cu128
torchvision 0.26.0+cu128
Key packages transformers, llmcompressor, datasets, opencv-python, Pillow

Notes on the Runtime

  • expandable_segments is not supported by the CUDA allocator on Windows, so this option had no effect and was left as a harmless no-op.
  • The quantization completed successfully on a single RTX 5090 without model sharding.

Quantization Run Log

Key observations from the run:

Stage GPU Memory
Original BF16 model loaded ~16.33 GB
After AWQ W4A16 applied ~6.74 GB
  • AWQ executed 37 sequential subgraphs.
  • All calibration, smoothing, and scale propagation steps completed without errors.
  • Model was saved with save_compressed=True, producing the compressed-tensors checkpoint layout.

Checkpoint Contents and Size

File Size
model.safetensors ~6.8 GB
config.json ~7.9 KB
tokenizer.json ~11 MB
chat_template.jinja ~5.3 KB
recipe.yaml ~1.4 KB
generation_config.json, processor_config.json, tokenizer_config.json small metadata
Total ~6.8 GB

Format Notes

This checkpoint is not in legacy AutoAWQ format (quant_method: "awq"). It uses the compressed-tensors format produced directly by llmcompressor:

"quantization_config": {
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed",
  "format": "pack-quantized",
  ...
}

Compatibility with downstream engines:

  • transformers can load it as long as compressed-tensors is installed.
  • vLLM support depends on the vLLM version understanding quant_method: "compressed-tensors" for Qwen3-VL; use a recent release and verify before deploying.

Known Limitations and Caveats

  • Post-training quantization (PTQ) always introduces some accuracy loss compared to the original BF16 checkpoint. No downstream benchmark evaluation was performed on this specific quantized checkpoint yet.
  • Deprecated API: the AWQModifier used in this run is the legacy compatibility shim in llmcompressor. Newer releases recommend replacing it with AWQTransformModifier followed by QuantizationModifier.
  • The vision encoder was excluded from quantization, so memory savings come almost entirely from the language-model weights.
  • Calibration was done on a general-domain academic video QA dataset, not on the original JoyAI-VL-Interaction videos. If you deploy this model for the exact domain the base model was tuned for, you may want to re-quantize with domain-specific calibration data.

Reproducibility

To reproduce this quantization from the base model:

  1. Download or clone MBZUAI/VCGBench-Diverse and extract the videos/ directory next to vcgbench_diverse_qa.json.
  2. Install the same environment (torch==2.11.0+cu128, torchvision==0.26.0+cu128, llmcompressor, transformers, datasets, opencv-python, Pillow).
  3. Run the quantization script with random seed 42, 128 samples, max_seq_length=2048.

License

Same license as the base model jdopensource/JoyAI-VL-Interaction-Preview. Please refer to the base model card for the exact license terms.

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openaiarka/JoyAI-VL-Interaction-Preview-AWQ

Quantized
(2)
this model