| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - inclusionAI/ZwZ-8B |
| | datasets: |
| | - inclusionAI/ZwZ-RL-VQA |
| | - inclusionAI/ZoomBench |
| | language: |
| | - en |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | tags: |
| | - text-generation-inference |
| | - F8_E4M3 |
| | - fp8 |
| | - vllm |
| | - llm-compressor |
| | --- |
| | |
| |  |
| |
|
| | # **ZwZ-8B-FP8** |
| |
|
| | > **ZwZ-8B-FP8** is an FP8-compressed evolution built on top of **inclusionAI/ZwZ-8B**. This variant leverages **BF16 · FP8 (F8_E4M3)** precision formats to significantly reduce memory footprint and improve inference efficiency while preserving the fine-grained multimodal perception strengths of the original architecture. |
| | > The result is a highly efficient 8B vision-language model optimized for real-time, single-pass visual reasoning with enhanced hardware efficiency. |
| | |
| | > [!important] |
| | > FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – [FP8 W8A8](https://docs.vllm.ai/en/stable/features/quantization/fp8/). Quantization W8A8 FP8-dynamic recipe – [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8). |
| | |
| | ## About the Base Model |
| | |
| | **ZwZ-8B** from inclusionAI is an 8B-parameter fine-grained multimodal perception vision-language model built upon Qwen3-VL-8B. It is trained using innovative **Region-to-Image Distillation (R2I)** combined with reinforcement learning to achieve state-of-the-art visual understanding in a single forward pass. |
| | |
| | Unlike traditional VLMs that require inference-time zooming, cropping, or tool calling, ZwZ internalizes region-level perception directly into full-image reasoning. |
| | |
| | ### Key Innovations of ZwZ-8B |
| | |
| | * **Region-to-Image Distillation (R2I)**: |
| | Teacher models such as Qwen3-VL-235B and GLM-4.5V generate high-fidelity VQA supervision on micro-cropped image regions with precise bounding boxes. This region-grounded supervision is distilled back into full-image context, allowing the student model to internalize fine-grained perception. |
| | |
| | * **Single-Pass Fine-Grained Understanding**: |
| | Eliminates multi-step inference pipelines involving zooming, cropping, or external tool calls. |
| | |
| | * **Strong Micro-Perception Capabilities**: |
| | |
| | * OCR and small-text detection |
| | * Object counting |
| | * Color and material attribute recognition |
| | * Structural analysis |
| | * Symbol and icon detection in dense scenes |
| | |
| | * **Out-of-Distribution Generalization**: |
| | Demonstrates strong performance on: |
| | |
| | * Visual reasoning benchmarks |
| | * GUI agent tasks |
| | * AIGC detection |
| | * Complex real-world scenes |
| | |
| | * **Edge-Optimized Deployment**: |
| | Enables real-time robotics and mobile vision applications without multi-stage inference overhead. |
| | |
| | ZwZ is part of a broader model family spanning 4B, 7B, and 8B scales. |
| | |
| | ## What FP8 Adds |
| | |
| | The **ZwZ-8B-FP8** variant introduces: |
| | |
| | * **BF16 · FP8 (F8_E4M3) Compression**: Transformer Engine–based quantization reduces VRAM usage while maintaining strong perception fidelity. |
| | * **Higher Throughput**: Improved tokens per second and image processing speed. |
| | * **Lower Memory Footprint**: Better deployment feasibility on Hopper-class and compatible GPUs. |
| | * **Production-Friendly Efficiency**: Ideal for real-time multimodal systems requiring compact yet powerful perception models. |
| |
|
| | ## Quick Start with Transformers |
| |
|
| | ```python |
| | from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
| | from qwen_vl_utils import process_vision_info |
| | import torch |
| | |
| | # Load the FP8-compressed ZwZ-8B model |
| | model = Qwen3VLForConditionalGeneration.from_pretrained( |
| | "prithivMLmods/ZwZ-8B-FP8", |
| | torch_dtype="auto", |
| | device_map="auto" |
| | ) |
| | |
| | processor = AutoProcessor.from_pretrained( |
| | "prithivMLmods/ZwZ-8B-FP8" |
| | ) |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
| | }, |
| | {"type": "text", "text": "Analyze the fine-grained details in this image."}, |
| | ], |
| | } |
| | ] |
| | |
| | text = processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) |
| | |
| | image_inputs, video_inputs = process_vision_info(messages) |
| | |
| | inputs = processor( |
| | text=[text], |
| | images=image_inputs, |
| | videos=video_inputs, |
| | padding=True, |
| | return_tensors="pt", |
| | ).to("cuda") |
| | |
| | generated_ids = model.generate(**inputs, max_new_tokens=256) |
| | |
| | generated_ids_trimmed = [ |
| | out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| | ] |
| | |
| | output_text = processor.batch_decode( |
| | generated_ids_trimmed, |
| | skip_special_tokens=True, |
| | clean_up_tokenization_spaces=False |
| | ) |
| | |
| | print(output_text) |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | * Real-time multimodal perception systems |
| | * Robotics and embodied AI |
| | * GUI agents |
| | * OCR-heavy and structured visual environments |
| | * Edge deployment scenarios requiring single-pass inference |
| |
|
| | ## Limitations & Risks |
| |
|
| | * FP8 requires compatible GPU architectures for optimal acceleration. |
| | * While compression maintains strong fidelity, extremely fine-grained edge cases may show minor precision differences compared to full BF16. |
| | * Users are responsible for ethical and lawful deployment. |