Update README with quantization details

Browse files

Files changed (1) hide show

README.md +60 -41

README.md CHANGED Viewed

@@ -2,76 +2,95 @@
 license: apache-2.0
 base_model: Qwen/Qwen3-VL-32B-Instruct
 tags:
-  - exl3
   - exllamav3
   - quantized
   - vision
   - multimodal
-  - qwen3
 library_name: exllamav3
 ---
-# Qwen3-VL-32B-Instruct EXL3 4.0bpw
-ExLlamaV3 quantization of [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) at 4.0 bits per weight.
-## Quantization Specifications
 | Parameter | Value |
 |-----------|-------|
-| **Format** | EXL3 (ExLlamaV3) |
-| **Bits per Weight** | 4.0 |
-| **Head Bits** | 6 |
 | **Calibration Rows** | 128 |
-| **Calibration Context** | 4096 |
-| **Codebook** | MCG |
-| **Output Scales** | Auto |
-| **ExLlamaV3 Version** | 0.0.16 |
-## Model Size
-| File | Size |
-|------|------|
-| model-00001-of-00003.safetensors | 7.9 GB |
-| model-00002-of-00003.safetensors | 8.0 GB |
-| model-00003-of-00003.safetensors | 1.9 GB |
-| **Total** | **~18 GB** |
-## Quality Metrics
-- **Final SQNR**: 40.95 dB (excellent)
-- **Cosine Similarity Error**: 0.000053
 ## Hardware Requirements
-- **Minimum VRAM**: ~20 GB (tight fit on RTX 4090 24GB)
-- **Recommended**: RTX 4090, RTX 3090, A100, or better
-## Usage
-This model requires [ExLlamaV3](https://github.com/turboderp/exllamav3) or compatible inference engines.
-### With TabbyAPI
-## Vision Capabilities
-This model supports multimodal input (text + images). Use OpenAI-compatible vision API format:
-## Original Model
-- **Base**: [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)
-- **Architecture**: Qwen3VLForConditionalGeneration
-- **Context Length**: Up to 128K tokens
-- **Vocab Size**: 151,669
-## Quantization Details
-Quantized on NVIDIA A100 80GB using ExLlamaV3 convert.py with standard calibration data (c4, wiki, code).
----
-*Quantized by [nullrunner](https://huggingface.co/nullrunner) - November 2025*

 license: apache-2.0
 base_model: Qwen/Qwen3-VL-32B-Instruct
 tags:
   - exllamav3
+  - exl3
   - quantized
+  - 4-bit
   - vision
   - multimodal
+  - instruct
+language:
+  - en
+  - it
+  - multilingual
 library_name: exllamav3
+pipeline_tag: image-text-to-text
 ---
+# Qwen3-VL-32B-Instruct-EXL3-4.0bpw
+ExLlamaV3 quantization of [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) - A powerful vision-language model for multimodal tasks.
+## Quantization Details
 | Parameter | Value |
 |-----------|-------|
+| **Bits per Weight** | 4.0 bpw |
+| **Head Bits** | 6 bpw |
 | **Calibration Rows** | 128 |
+| **Calibration Context** | 4096 tokens |
+| **Format** | ExLlamaV3 (EXL3) |
+| **Size** | ~19 GB |
+## Model Capabilities
+- **Vision Understanding**: Process images at various resolutions
+- **Video Analysis**: Frame-by-frame understanding
+- **Context Window**: Up to 128K tokens
+- **Instruction Following**: Fine-tuned for chat and task completion
+- **Multilingual**: Strong performance across languages
 ## Hardware Requirements
+| GPU | VRAM | Notes |
+|-----|------|-------|
+| RTX 4090 | 24 GB | Good fit, comfortable with images |
+| RTX 3090 | 24 GB | Works well |
+| A100 40GB | 40 GB | Plenty of headroom |
+## Use Cases
+- **Live Assistant**: Real-time screen understanding
+- **Document Processing**: Extract and analyze document content
+- **Image Description**: Detailed visual descriptions
+- **Visual Coding**: Understand code in screenshots
+- **Chart/Graph Analysis**: Interpret data visualizations
+## Usage with TabbyAPI
+```yaml
+# config.yml
+model:
+  model_dir: models
+  model_name: Qwen3-VL-32B-Instruct-EXL3-4.0bpw
+network:
+  host: 0.0.0.0
+  port: 5000
+model_defaults:
+  max_seq_len: 16384
+  cache_mode: Q4
+```
+## Recommended Settings
+- Temperature: 0.7
+- Top-P: 0.8
+- Top-K: 20
+- Repetition Penalty: 1.05
+## Comparison with Thinking Variant
+| Model | Best For |
+|-------|----------|
+| **This (Instruct)** | Fast responses, direct answers, general tasks |
+| **Thinking variant** | Complex reasoning, step-by-step analysis |
+## Original Model
+This is a quantization of [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct). All credit for the base model goes to the Qwen team at Alibaba.
+## License
+Apache 2.0 (inherited from base model)