Update README.md

414871a verified 5 days ago

5.24 kB

	---
	license: apache-2.0
	base_model:
	- inclusionAI/ZwZ-8B
	datasets:
	- inclusionAI/ZwZ-RL-VQA
	- inclusionAI/ZoomBench
	language:
	- en
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- text-generation-inference
	- F8_E4M3
	- fp8
	- vllm
	- llm-compressor
	---

	![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/cJvpKspuxHdZNnkURe5jC.png)

	# ZwZ-8B-FP8

	> ZwZ-8B-FP8 is an FP8-compressed evolution built on top of inclusionAI/ZwZ-8B. This variant leverages BF16 · FP8 (F8_E4M3) precision formats to significantly reduce memory footprint and improve inference efficiency while preserving the fine-grained multimodal perception strengths of the original architecture.
	> The result is a highly efficient 8B vision-language model optimized for real-time, single-pass visual reasoning with enhanced hardware efficiency.

	> [!important]
	> FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – [FP8 W8A8](https://docs.vllm.ai/en/stable/features/quantization/fp8/). Quantization W8A8 FP8-dynamic recipe – [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8).

	## About the Base Model

	ZwZ-8B from inclusionAI is an 8B-parameter fine-grained multimodal perception vision-language model built upon Qwen3-VL-8B. It is trained using innovative Region-to-Image Distillation (R2I) combined with reinforcement learning to achieve state-of-the-art visual understanding in a single forward pass.

	Unlike traditional VLMs that require inference-time zooming, cropping, or tool calling, ZwZ internalizes region-level perception directly into full-image reasoning.

	### Key Innovations of ZwZ-8B

	* Region-to-Image Distillation (R2I):
	Teacher models such as Qwen3-VL-235B and GLM-4.5V generate high-fidelity VQA supervision on micro-cropped image regions with precise bounding boxes. This region-grounded supervision is distilled back into full-image context, allowing the student model to internalize fine-grained perception.

	* Single-Pass Fine-Grained Understanding:
	Eliminates multi-step inference pipelines involving zooming, cropping, or external tool calls.

	* Strong Micro-Perception Capabilities:

	* OCR and small-text detection
	* Object counting
	* Color and material attribute recognition
	* Structural analysis
	* Symbol and icon detection in dense scenes

	* Out-of-Distribution Generalization:
	Demonstrates strong performance on:

	* Visual reasoning benchmarks
	* GUI agent tasks
	* AIGC detection
	* Complex real-world scenes

	* Edge-Optimized Deployment:
	Enables real-time robotics and mobile vision applications without multi-stage inference overhead.

	ZwZ is part of a broader model family spanning 4B, 7B, and 8B scales.

	## What FP8 Adds

	The ZwZ-8B-FP8 variant introduces:

	* BF16 · FP8 (F8_E4M3) Compression: Transformer Engine–based quantization reduces VRAM usage while maintaining strong perception fidelity.
	* Higher Throughput: Improved tokens per second and image processing speed.
	* Lower Memory Footprint: Better deployment feasibility on Hopper-class and compatible GPUs.
	* Production-Friendly Efficiency: Ideal for real-time multimodal systems requiring compact yet powerful perception models.

	## Quick Start with Transformers

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info
	import torch

	# Load the FP8-compressed ZwZ-8B model
	model = Qwen3VLForConditionalGeneration.from_pretrained(
	"prithivMLmods/ZwZ-8B-FP8",
	torch_dtype="auto",
	device_map="auto"
	)

	processor = AutoProcessor.from_pretrained(
	"prithivMLmods/ZwZ-8B-FP8"
	)

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Analyze the fine-grained details in this image."},
	],
	}
	]

	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)

	image_inputs, video_inputs = process_vision_info(messages)

	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=256)

	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]

	output_text = processor.batch_decode(
	generated_ids_trimmed,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False
	)

	print(output_text)
	```

	## Intended Use

	* Real-time multimodal perception systems
	* Robotics and embodied AI
	* GUI agents
	* OCR-heavy and structured visual environments
	* Edge deployment scenarios requiring single-pass inference

	## Limitations & Risks

	* FP8 requires compatible GPU architectures for optimal acceleration.
	* While compression maintains strong fidelity, extremely fine-grained edge cases may show minor precision differences compared to full BF16.
	* Users are responsible for ethical and lawful deployment.