| | --- |
| | tags: |
| | - fp4 |
| | - vllm |
| | language: |
| | - en |
| | - de |
| | - fr |
| | - it |
| | - pt |
| | - hi |
| | - es |
| | - th |
| | pipeline_tag: text-generation |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3-VL-235B-A22B-Instruct |
| | --- |
| | |
| | # Qwen3-VL-235B-A22B-Instruct-NVFP4 |
| |
|
| | ## Model Overview |
| | - **Model Architecture:** Qwen/Qwen3-VL-235B-A22B-Instruct |
| | - **Input:** Text |
| | - **Output:** Text |
| | - **Model Optimizations:** |
| | - **Weight quantization:** FP4 |
| | - **Activation quantization:** FP4 |
| | - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. |
| | - **Release Date:** 10/29/2025 |
| | - **Version:** 1.0 |
| | - **Model Developers:** RedHatAI |
| |
|
| | This model is a quantized version of [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). |
| | It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. |
| |
|
| | ### Model Optimizations |
| |
|
| | This model was obtained by quantizing the weights and activations of [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) to FP4 data type, ready for inference with vLLM>=0.9.1 |
| | This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. |
| |
|
| | Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). |
| |
|
| | ## Deployment |
| |
|
| | ### Use with vLLM |
| |
|
| | This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | from transformers import AutoTokenizer |
| | |
| | model_id = "RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4" |
| | number_gpus = 1 |
| | |
| | sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, |
| | {"role": "user", "content": "Who are you?"}, |
| | ] |
| | |
| | prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
| | |
| | llm = LLM(model=model_id, tensor_parallel_size=number_gpus) |
| | |
| | outputs = llm.generate(prompts, sampling_params) |
| | |
| | generated_text = outputs[0].outputs[0].text |
| | print(generated_text) |
| | ``` |
| |
|
| | vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. |
| |
|
| | ## Creation |
| |
|
| | This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below. |
| |
|
| | <details> |
| | |
| | ```python |
| | import torch |
| | from datasets import load_dataset |
| | from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration |
| | |
| | from llmcompressor import oneshot |
| | from llmcompressor.modeling import replace_modules_for_calibration |
| | from llmcompressor.modifiers.quantization import QuantizationModifier |
| | from llmcompressor.utils import dispatch_for_generation |
| | |
| | # NOTE: Requires a minimum of transformers 4.57.0 |
| | |
| | MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct" |
| | |
| | |
| | # Load model. |
| | model = Qwen3VLMoeForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto") |
| | processor = AutoProcessor.from_pretrained(MODEL_ID) |
| | model = replace_modules_for_calibration(model) |
| | |
| | DATASET_ID = "neuralmagic/calibration" |
| | NUM_CALIBRATION_SAMPLES = 20 |
| | MAX_SEQUENCE_LENGTH = 8192 |
| | |
| | ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]") |
| | |
| | |
| | def preprocess_function(example): |
| | messgages = [] |
| | for message in example["messages"]: |
| | messgages.append( |
| | { |
| | "role": message["role"], |
| | "content": [{"type": "text", "text": message["content"]}], |
| | } |
| | ) |
| | |
| | return processor.apply_chat_template( |
| | messgages, |
| | return_tensors="pt", |
| | padding=False, |
| | truncation=True, |
| | max_length=MAX_SEQUENCE_LENGTH, |
| | tokenize=True, |
| | add_special_tokens=False, |
| | return_dict=True, |
| | add_generation_prompt=False, |
| | ) |
| | |
| | |
| | ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names) |
| | |
| | |
| | def data_collator(batch): |
| | assert len(batch) == 1 |
| | return { |
| | key: ( |
| | torch.tensor(value) |
| | if key != "pixel_values" |
| | else torch.tensor(value, dtype=torch.bfloat16).squeeze(0) |
| | ) |
| | for key, value in batch[0].items() |
| | } |
| | |
| | |
| | # Configure the quantization algorithm and scheme. |
| | # In this case, we: |
| | # * quantize the weights to fp4 with group-wise quantization |
| | # * quantize the activations to fp4 with dynamic group activations |
| | recipe = QuantizationModifier( |
| | targets="Linear", |
| | scheme="NVFP4", |
| | ignore=[ |
| | "re:.*lm_head", |
| | "re:visual.*", |
| | "re:model.visual.*", |
| | "re:.*mlp.gate$", |
| | ], |
| | ) |
| | |
| | # Apply quantization. |
| | oneshot( |
| | model=model, |
| | recipe=recipe, |
| | max_seq_length=MAX_SEQUENCE_LENGTH, |
| | num_calibration_samples=NUM_CALIBRATION_SAMPLES, |
| | dataset=ds, |
| | data_collator=data_collator, |
| | ) |
| | |
| | print("========== SAMPLE GENERATION ==============") |
| | dispatch_for_generation(model) |
| | input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda") |
| | output = model.generate(input_ids, max_new_tokens=20) |
| | print(processor.decode(output[0])) |
| | print("==========================================") |
| | |
| | |
| | # Save to disk in compressed-tensors format. |
| | SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" |
| | model.save_pretrained(SAVE_DIR) |
| | processor.save_pretrained(SAVE_DIR) |
| | |
| | ``` |
| | </details> |
| |
|
| | ## Evaluation |
| |
|
| | This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval). |
| | |
| | ### Accuracy |
| | <table> |
| | <thead> |
| | <tr> |
| | <th>Category</th> |
| | <th>Metric</th> |
| | <th>Qwen/Qwen3-VL-235B-A22B-Instruct</th> |
| | <th>RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 (this model)</th> |
| | <th>Recovery</th> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <!-- OpenLLM --> |
| | <tr> |
| | <td rowspan="7"><b>OpenLLM</b></td> |
| | <td>arc_challenge</td> |
| | <td>72.95</td> |
| | <td>71.59</td> |
| | <td>98.13</td> |
| | </tr> |
| | <tr> |
| | <td>gsm8k</td> |
| | <td>90.37</td> |
| | <td>88.25</td> |
| | <td>97.65</td> |
| | </tr> |
| | <tr> |
| | <td>hellaswag</td> |
| | <td>87.94</td> |
| | <td>86.80</td> |
| | <td>98.70</td> |
| | </tr> |
| | <tr> |
| | <td>mmlu</td> |
| | <td>87.12</td> |
| | <td>86.22</td> |
| | <td>98.97</td> |
| | </tr> |
| | <tr> |
| | <td>truthfulqa_mc2</td> |
| | <td>63.31</td> |
| | <td>62.37</td> |
| | <td>98.52</td> |
| | </tr> |
| | <tr> |
| | <td>winogrande</td> |
| | <td>81.93</td> |
| | <td>80.43</td> |
| | <td>98.17</td> |
| | </tr> |
| | <tr> |
| | <td><b>Average</b></td> |
| | <td><b>80.60</b></td> |
| | <td><b>79.28</b></td> |
| | <td><b>98.35</b></td> |
| | </tr> |
| | <!-- Vision --> |
| | <tr> |
| | <td rowspan="7"><b>Vision</b></td> |
| | <td>mmmu_val</td> |
| | <td>63.56</td> |
| | <td>62.11</td> |
| | <td>97.71</td> |
| | </tr> |
| | <tr> |
| | <td>chartqa</td> |
| | <td>90.52</td> |
| | <td>89.00</td> |
| | <td>98.32</td> |
| | </tr> |
| | <tr> |
| | <td><b>Average</b></td> |
| | <td><b>77.04</b></td> |
| | <td><b>75.56</b></td> |
| | <td><b>98.08</b></td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| |
|
| |
|
| | ### Reproduction |
| |
|
| | The results were obtained using the following commands: |
| |
|
| | <details> |
| |
|
| | #### OpenLLM |
| | ``` |
| | lm_eval \ |
| | --model vllm \ |
| | --model_args pretrained="RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ |
| | --apply_chat_template \ |
| | --fewshot_as_multiturn \ |
| | --tasks openllm \ |
| | --batch_size auto |
| | ``` |
| |
|
| | #### Vision |
| | ``` |
| | python3 -m lmms_eval \ |
| | --model vllm \ |
| | --model_args model=RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4,tensor_parallel_size=4,max_model_len=20000 \ |
| | --tasks chartqa,mmmu_val \ |
| | --batch_size 1 |
| | ``` |
| |
|
| | </details> |