nullrunner commited on
Commit
467eeae
·
verified ·
1 Parent(s): 3706e6e

Update README with quantization details

Browse files
Files changed (1) hide show
  1. README.md +60 -41
README.md CHANGED
@@ -2,76 +2,95 @@
2
  license: apache-2.0
3
  base_model: Qwen/Qwen3-VL-32B-Instruct
4
  tags:
5
- - exl3
6
  - exllamav3
 
7
  - quantized
 
8
  - vision
9
  - multimodal
10
- - qwen3
 
 
 
 
11
  library_name: exllamav3
 
12
  ---
13
 
14
- # Qwen3-VL-32B-Instruct EXL3 4.0bpw
15
 
16
- ExLlamaV3 quantization of [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) at 4.0 bits per weight.
17
 
18
- ## Quantization Specifications
19
 
20
  | Parameter | Value |
21
  |-----------|-------|
22
- | **Format** | EXL3 (ExLlamaV3) |
23
- | **Bits per Weight** | 4.0 |
24
- | **Head Bits** | 6 |
25
  | **Calibration Rows** | 128 |
26
- | **Calibration Context** | 4096 |
27
- | **Codebook** | MCG |
28
- | **Output Scales** | Auto |
29
- | **ExLlamaV3 Version** | 0.0.16 |
30
-
31
- ## Model Size
32
-
33
- | File | Size |
34
- |------|------|
35
- | model-00001-of-00003.safetensors | 7.9 GB |
36
- | model-00002-of-00003.safetensors | 8.0 GB |
37
- | model-00003-of-00003.safetensors | 1.9 GB |
38
- | **Total** | **~18 GB** |
39
 
40
- ## Quality Metrics
41
 
42
- - **Final SQNR**: 40.95 dB (excellent)
43
- - **Cosine Similarity Error**: 0.000053
 
 
 
44
 
45
  ## Hardware Requirements
46
 
47
- - **Minimum VRAM**: ~20 GB (tight fit on RTX 4090 24GB)
48
- - **Recommended**: RTX 4090, RTX 3090, A100, or better
 
 
 
49
 
50
- ## Usage
51
 
52
- This model requires [ExLlamaV3](https://github.com/turboderp/exllamav3) or compatible inference engines.
 
 
 
 
53
 
54
- ### With TabbyAPI
55
 
 
 
 
 
 
56
 
 
 
 
57
 
58
- ## Vision Capabilities
 
 
 
59
 
60
- This model supports multimodal input (text + images). Use OpenAI-compatible vision API format:
61
 
 
 
 
 
62
 
 
63
 
64
- ## Original Model
65
-
66
- - **Base**: [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)
67
- - **Architecture**: Qwen3VLForConditionalGeneration
68
- - **Context Length**: Up to 128K tokens
69
- - **Vocab Size**: 151,669
70
 
71
- ## Quantization Details
72
 
73
- Quantized on NVIDIA A100 80GB using ExLlamaV3 convert.py with standard calibration data (c4, wiki, code).
74
 
75
- ---
76
 
77
- *Quantized by [nullrunner](https://huggingface.co/nullrunner) - November 2025*
 
2
  license: apache-2.0
3
  base_model: Qwen/Qwen3-VL-32B-Instruct
4
  tags:
 
5
  - exllamav3
6
+ - exl3
7
  - quantized
8
+ - 4-bit
9
  - vision
10
  - multimodal
11
+ - instruct
12
+ language:
13
+ - en
14
+ - it
15
+ - multilingual
16
  library_name: exllamav3
17
+ pipeline_tag: image-text-to-text
18
  ---
19
 
20
+ # Qwen3-VL-32B-Instruct-EXL3-4.0bpw
21
 
22
+ ExLlamaV3 quantization of [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) - A powerful vision-language model for multimodal tasks.
23
 
24
+ ## Quantization Details
25
 
26
  | Parameter | Value |
27
  |-----------|-------|
28
+ | **Bits per Weight** | 4.0 bpw |
29
+ | **Head Bits** | 6 bpw |
 
30
  | **Calibration Rows** | 128 |
31
+ | **Calibration Context** | 4096 tokens |
32
+ | **Format** | ExLlamaV3 (EXL3) |
33
+ | **Size** | ~19 GB |
 
 
 
 
 
 
 
 
 
 
34
 
35
+ ## Model Capabilities
36
 
37
+ - **Vision Understanding**: Process images at various resolutions
38
+ - **Video Analysis**: Frame-by-frame understanding
39
+ - **Context Window**: Up to 128K tokens
40
+ - **Instruction Following**: Fine-tuned for chat and task completion
41
+ - **Multilingual**: Strong performance across languages
42
 
43
  ## Hardware Requirements
44
 
45
+ | GPU | VRAM | Notes |
46
+ |-----|------|-------|
47
+ | RTX 4090 | 24 GB | Good fit, comfortable with images |
48
+ | RTX 3090 | 24 GB | Works well |
49
+ | A100 40GB | 40 GB | Plenty of headroom |
50
 
51
+ ## Use Cases
52
 
53
+ - **Live Assistant**: Real-time screen understanding
54
+ - **Document Processing**: Extract and analyze document content
55
+ - **Image Description**: Detailed visual descriptions
56
+ - **Visual Coding**: Understand code in screenshots
57
+ - **Chart/Graph Analysis**: Interpret data visualizations
58
 
59
+ ## Usage with TabbyAPI
60
 
61
+ ```yaml
62
+ # config.yml
63
+ model:
64
+ model_dir: models
65
+ model_name: Qwen3-VL-32B-Instruct-EXL3-4.0bpw
66
 
67
+ network:
68
+ host: 0.0.0.0
69
+ port: 5000
70
 
71
+ model_defaults:
72
+ max_seq_len: 16384
73
+ cache_mode: Q4
74
+ ```
75
 
76
+ ## Recommended Settings
77
 
78
+ - Temperature: 0.7
79
+ - Top-P: 0.8
80
+ - Top-K: 20
81
+ - Repetition Penalty: 1.05
82
 
83
+ ## Comparison with Thinking Variant
84
 
85
+ | Model | Best For |
86
+ |-------|----------|
87
+ | **This (Instruct)** | Fast responses, direct answers, general tasks |
88
+ | **Thinking variant** | Complex reasoning, step-by-step analysis |
 
 
89
 
90
+ ## Original Model
91
 
92
+ This is a quantization of [Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct). All credit for the base model goes to the Qwen team at Alibaba.
93
 
94
+ ## License
95
 
96
+ Apache 2.0 (inherited from base model)