JoyAI-VL-Interaction-Preview-AWQ

AWQ W4A16 post-training quantization (PTQ) of jdopensource/JoyAI-VL-Interaction-Preview.

This model card focuses on how the quantized checkpoint was produced, not on how to consume the model (for inference usage please refer to the base model card).

Base Model

Attribute	Value
Base model	`jdopensource/JoyAI-VL-Interaction-Preview`
Architecture	`Qwen3VLForConditionalGeneration`
Scale	~8B parameters
Original dtype	`bfloat16`

Quantization Method

Attribute	Value
Method	AWQ — Activation-aware Weight Quantization
Tool	`llmcompressor`
API used	`llmcompressor.oneshot()` with `AWQModifier`
Quantization scheme	W4A16 (4-bit weights, 16-bit activations)
Format	`compressed-tensors` (`quant_method: "compressed-tensors"`)
Quantization status	`compressed`
Group size	128
Weight packing	`pack-quantized` (INT4 packed into INT32 containers)
Duo scaling	Enabled (`duo_scaling=True`)
AWQ grid search size	`n_grid: 20`
Symmetric weights	Yes
Smoothing mappings	Auto-inferred by `llmcompressor` for `Qwen3VLForConditionalGeneration`

Modules Targeted and Excluded

Targeted: all Linear layers in the text model.
Excluded from quantization:
- lm_head
- All vision components: visual, vision_tower, vision_model, vision_proj, merger

The vision encoder and language model head were intentionally kept in higher precision to preserve visual understanding quality and output embedding fidelity.

The exact excluded module list recorded in config.json is:

["model.visual.blocks.0.attn.qkv", "model.visual.blocks.0.attn.proj", ...]

(fully expanded in the repository's config.json under quantization_config.ignore)

Quantization Recipe (as saved in `recipe.yaml`)

default_stage:
  default_modifiers:
    AWQModifier:
      mappings:
      - smooth_layer: re:.*input_layernorm$
        balance_layers: ['re:.*q_proj$', 're:.*k_proj$', 're:.*v_proj$']
      - smooth_layer: re:.*v_proj$
        balance_layers: ['re:.*o_proj$']
      - smooth_layer: re:.*post_attention_layernorm$
        balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
      - smooth_layer: re:.*up_proj$
        balance_layers: ['re:.*down_proj$']
      duo_scaling: true
      n_grid: 20
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head, 're:.*visual.*', 're:.*vision_tower.*', 're:.*vision_model.*',
               're:.*vision_proj.*', 're:.*merger.*']
      scheme: W4A16
      bypass_divisibility_checks: false

llmcompressor internally split the request into an AWQ smoothing step followed by a QuantizationModifier step.

Calibration Dataset

Because the original jdopensource/JoyAI-VL-Interaction dataset only contains annotation JSON files and not the actual video assets, calibration was performed on a publicly available video-question-answering dataset that ships with real .mp4 files.

Attribute	Value
Dataset	`MBZUAI/VCGBench-Diverse`
Videos available	877 unique `.mp4` files
QA pairs available	4,354 entries in `vcgbench_diverse_qa.json`
Samples used	128 QA pairs
Sampling seed	42
Filtering rule	Only QA entries whose referenced video file actually exists were kept

Preprocessing Pipeline

For each sampled QA pair:

Open the referenced .mp4 with OpenCV.
Extract the first video frame.
Convert the frame to a PIL RGB image.

Build a single-turn user message:

[
  {"type": "image"},
  {"type": "text", "text": "<question text>"}
]

Apply the Qwen3-VL chat template (add_generation_prompt=True).
Process through the base model's AutoProcessor with max_length=2048 and truncation=True.

The calibration batch therefore consists of image + text tokenized inputs matching the model's expected multimodal format.

Hardware and Software Environment

Attribute	Value
GPU	NVIDIA RTX 5090 (Blackwell, SM 120, 32 GB)
OS	Windows
CUDA	13.1 runtime driver; PyTorch `cu128` wheel
Python	3.10
PyTorch	`2.11.0+cu128`
torchvision	`0.26.0+cu128`
Key packages	`transformers`, `llmcompressor`, `datasets`, `opencv-python`, `Pillow`

Notes on the Runtime

expandable_segments is not supported by the CUDA allocator on Windows, so this option had no effect and was left as a harmless no-op.
The quantization completed successfully on a single RTX 5090 without model sharding.

Quantization Run Log

Key observations from the run:

Stage	GPU Memory
Original BF16 model loaded	~16.33 GB
After AWQ W4A16 applied	~6.74 GB

AWQ executed 37 sequential subgraphs.
All calibration, smoothing, and scale propagation steps completed without errors.
Model was saved with save_compressed=True, producing the compressed-tensors checkpoint layout.

Checkpoint Contents and Size

File	Size
`model.safetensors`	~6.8 GB
`config.json`	~7.9 KB
`tokenizer.json`	~11 MB
`chat_template.jinja`	~5.3 KB
`recipe.yaml`	~1.4 KB
`generation_config.json`, `processor_config.json`, `tokenizer_config.json`	small metadata
Total	~6.8 GB

Format Notes

This checkpoint is not in legacy AutoAWQ format (quant_method: "awq"). It uses the compressed-tensors format produced directly by llmcompressor:

"quantization_config": {
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed",
  "format": "pack-quantized",
  ...
}

Compatibility with downstream engines:

transformers can load it as long as compressed-tensors is installed.
vLLM support depends on the vLLM version understanding quant_method: "compressed-tensors" for Qwen3-VL; use a recent release and verify before deploying.

Known Limitations and Caveats

Post-training quantization (PTQ) always introduces some accuracy loss compared to the original BF16 checkpoint. No downstream benchmark evaluation was performed on this specific quantized checkpoint yet.
Deprecated API: the AWQModifier used in this run is the legacy compatibility shim in llmcompressor. Newer releases recommend replacing it with AWQTransformModifier followed by QuantizationModifier.
The vision encoder was excluded from quantization, so memory savings come almost entirely from the language-model weights.
Calibration was done on a general-domain academic video QA dataset, not on the original JoyAI-VL-Interaction videos. If you deploy this model for the exact domain the base model was tuned for, you may want to re-quantize with domain-specific calibration data.

Reproducibility

To reproduce this quantization from the base model:

Download or clone MBZUAI/VCGBench-Diverse and extract the videos/ directory next to vcgbench_diverse_qa.json.
Install the same environment (torch==2.11.0+cu128, torchvision==0.26.0+cu128, llmcompressor, transformers, datasets, opencv-python, Pillow).
Run the quantization script with random seed 42, 128 samples, max_seq_length=2048.

License

Same license as the base model jdopensource/JoyAI-VL-Interaction-Preview. Please refer to the base model card for the exact license terms.

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openaiarka/JoyAI-VL-Interaction-Preview-AWQ

Base model

jdopensource/JoyAI-VL-Interaction-Preview

Quantized

(2)

this model