blockmandev Claude Opus 4.6 commited on
Commit
ca2f93d
Β·
0 Parent(s):

QORA-4B: Pure Rust multimodal inference engine

Browse files

Based on Qwen3.5-4B. Q4 quantized text + F16 vision weights.
Text, image, and video understanding with thinking mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (5) hide show
  1. .gitattributes +3 -0
  2. README.md +199 -0
  3. model.qor4b +3 -0
  4. qor4b.exe +3 -0
  5. tokenizer.json +3 -0
.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ *.qor4b filter=lfs diff=lfs merge=lfs -text
2
+ *.exe filter=lfs diff=lfs merge=lfs -text
3
+ *.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3.5-4B
4
+ language:
5
+ - en
6
+ - zh
7
+ - multilingual
8
+ library_name: rust
9
+ tags:
10
+ - text-generation
11
+ - image-text-to-text
12
+ - video-text-to-text
13
+ - multimodal
14
+ - vision
15
+ - rust
16
+ - pure-rust
17
+ - no-python
18
+ - quantized
19
+ - deltanet
20
+ - hybrid-attention
21
+ pipeline_tag: image-text-to-text
22
+ model-index:
23
+ - name: QORA-4B
24
+ results: []
25
+ ---
26
+
27
+ # QORA-4B
28
+
29
+ Pure Rust multimodal inference engine based on [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B). No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable AI that runs on any machine.
30
+
31
+ ## License
32
+
33
+ This project is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). The base model [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) is released by the Qwen team under Apache 2.0.
34
+
35
+ ## What It Does
36
+
37
+ QORA-4B is a 4-billion parameter language model with built-in vision. It can:
38
+
39
+ - **Text generation** β€” answer questions, write code, reason through problems
40
+ - **Image understanding** β€” describe photos, answer questions about images
41
+ - **Video understanding** β€” analyze frame sequences, describe motion and temporal changes
42
+ - **Thinking mode** β€” extended chain-of-thought reasoning with configurable budget
43
+
44
+ ## Architecture
45
+
46
+ QORA-4B uses a hybrid architecture combining two attention mechanisms:
47
+
48
+ | Component | Details |
49
+ |-----------|---------|
50
+ | **Parameters** | 4B total |
51
+ | **Hidden dim** | 2560 |
52
+ | **Layers** | 32 (24 DeltaNet + 8 Full Attention) |
53
+ | **Layer pattern** | 3x DeltaNet + 1x Full Attention, repeated 8 times |
54
+ | **Vocabulary** | 248,320 tokens |
55
+ | **Context** | 262K tokens natively |
56
+
57
+ ### DeltaNet Layers (24 of 32)
58
+ - Gated linear attention with delta rule state updates
59
+ - 16 QK heads + 32 V heads, head_dim=128
60
+ - Causal Conv1d (kernel=4) + SiLU activation
61
+ - O(1) memory per token (recurrent state, no KV cache needed)
62
+
63
+ ### Full Attention Layers (8 of 32)
64
+ - Grouped Query Attention (16Q / 4KV heads), head_dim=256
65
+ - QK-norm + partial RoPE (64/256 dims rotated), theta=10M
66
+ - Output gating (sigmoid gate on attention output)
67
+ - Standard KV cache
68
+
69
+ ### Vision Encoder
70
+ - 24-layer ViT, hidden=1024, 16 heads
71
+ - Conv3d patch embedding [1024, 3, 2, 16, 16] (temporal_patch_size=2)
72
+ - Learned positional embedding with bilinear interpolation from 48x48 grid
73
+ - 2D spatial RoPE (dim=32, theta=10000)
74
+ - 2x2 spatial merger: LayerNorm β†’ concat β†’ MLP(4096 β†’ 2560)
75
+ - **Images**: single frame duplicated along temporal axis
76
+ - **Video**: actual Conv3d over consecutive frame pairs (N frames β†’ N/2 temporal patches)
77
+
78
+ ## Weight Formats
79
+
80
+ | Format | Size | Quality | Speed |
81
+ |--------|------|---------|-------|
82
+ | **Q4** (default) | ~2.9 GB | Good | ~0.9 tok/s |
83
+ | **F16** | ~7.5 GB | Best | ~0.5 tok/s |
84
+
85
+ Q4 uses 4-bit symmetric quantization with group_size=32 and LUT-optimized dequantization. Multi-threaded GEMV/GEMM via rayon for large matrices.
86
+
87
+ ## Quick Start
88
+
89
+ 1. Download `qor4b.exe`, `model.qor4b`, and `tokenizer.json` into the same folder
90
+ 2. Run:
91
+
92
+ ```bash
93
+ # Text generation
94
+ qor4b --prompt "Explain quantum computing" --max-tokens 500
95
+
96
+ # Image understanding
97
+ qor4b --prompt "What's in this image?" --image photo.jpg
98
+
99
+ # Video understanding (directory of frame images)
100
+ qor4b --prompt "What happens in this video?" --video frames_dir/
101
+
102
+ # Thinking mode (default, extended reasoning)
103
+ qor4b --prompt "Solve: integral of x^2 * e^x dx" --think-budget 2048
104
+
105
+ # No-think mode (faster, direct answers)
106
+ qor4b --prompt "What is 2+2?" --no-think
107
+
108
+ # Greedy decoding (deterministic output)
109
+ qor4b --prompt "Hello" --greedy
110
+ ```
111
+
112
+ ### CLI Flags
113
+
114
+ | Flag | Description |
115
+ |------|-------------|
116
+ | `--prompt TEXT` | Input prompt (default: "Hello, how are you?") |
117
+ | `--image PATH` | Path to an image file (PNG/JPG) |
118
+ | `--video PATH` | Path to directory of frame images (PNG/JPG, sorted by name) |
119
+ | `--max-tokens N` | Max tokens to generate (default: 1024) |
120
+ | `--think-budget N` | Max thinking tokens before forcing answer (default: 1024) |
121
+ | `--no-think` | Disable thinking mode (direct answers) |
122
+ | `--show-think` | Display thinking tokens on stderr |
123
+ | `--greedy` | Greedy decoding (temperature=0, not recommended with thinking mode) |
124
+
125
+ ### Sampling Defaults
126
+
127
+ | Parameter | Think mode | No-think mode |
128
+ |-----------|-----------|---------------|
129
+ | temperature | 1.0 | 0.7 |
130
+ | top_k | 20 | 20 |
131
+ | top_p | 0.95 | 0.95 |
132
+ | presence_penalty | 1.5 | 1.5 |
133
+
134
+ ### Video Input
135
+
136
+ Video is provided as a directory of frame images (not a video file). Extract frames however you like:
137
+
138
+ ```bash
139
+ # Example: extract 4 frames from a video with ffmpeg
140
+ ffmpeg -i video.mp4 -vf "select=not(mod(n\,30))" -frames:v 4 frames/frame_%02d.png
141
+
142
+ # Then run
143
+ qor4b --prompt "Describe what happens" --video frames/
144
+ ```
145
+
146
+ Frames are loaded in alphabetical order, resized to uniform dimensions (max 768px, divisible by 32), and processed as temporal pairs via Conv3d. Odd frame counts are padded by duplicating the last frame.
147
+
148
+ **Performance guide:**
149
+ - 4 frames @ 256x256: ~180s vision encoder, 128 merged tokens
150
+ - 8 frames @ 256x256: ~10min vision encoder, 256 merged tokens
151
+ - Keep frames small and few for interactive use
152
+
153
+ ## Built With
154
+
155
+ - **Language**: Pure Rust (2024 edition)
156
+ - **Dependencies**: `half` (f16), `rayon` (parallelism), `image` (image loading), `tokenizers` (HuggingFace tokenizer), `memmap2` (mmap for converter), `serde_json` (config parsing)
157
+ - **No ML framework** for inference β€” all matrix ops are hand-written Rust
158
+ - **Burn framework** used only as a build dependency (for binary format types)
159
+
160
+ ## File Structure
161
+
162
+ ```
163
+ src/
164
+ main.rs β€” CLI entry point, argument parsing
165
+ config.rs β€” Model architecture configuration
166
+ gemv.rs β€” GEMV/GEMM kernels (F16 + Q4), hybrid forward pass, prefill
167
+ generate.rs β€” Text generation loop (text, image, video modes)
168
+ tokenizer.rs β€” Tokenizer wrapper and chat templates
169
+ vision.rs β€” Vision encoder (ViT + merger), image/video loading
170
+ save.rs β€” Binary model format (.qor4b) save/load
171
+ convert.rs β€” One-time safetensors β†’ .qor4b converter
172
+ lib.rs β€” Module exports
173
+ ```
174
+
175
+ ## Model Binary Format (.qor4b)
176
+
177
+ Custom binary format for fast loading:
178
+
179
+ ```
180
+ Header: "QOR4" magic + version(u32) + format(u8: 0=F16, 1=Q4)
181
+ Config: Architecture params (vocab, hidden, layers, heads, etc.)
182
+ Layers: 32 layers, each with type byte + layer-specific weights
183
+ Global: Embedding + final norm + precomputed RoPE tables
184
+ Vision: Conv3d patch embed + pos_embed + 24 ViT blocks + merger MLP
185
+ ```
186
+
187
+ Loading is ~30s for the Q4 model (~2.9 GB) via buffered sequential reads.
188
+
189
+ ## Performance
190
+
191
+ Tested on i5-11500 (6C/12T), 16GB RAM, CPU-only:
192
+
193
+ | Task | Speed |
194
+ |------|-------|
195
+ | Text decode | ~0.9 tok/s (Q4) |
196
+ | Text prefill | ~1.0 tok/s |
197
+ | Image encode (256x256) | ~90s |
198
+ | Video encode (4 frames, 256x256) | ~180s |
199
+ | Model load (Q4) | ~37s |
model.qor4b ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f130350857c0992291b689af2ef24e4d46ec79e96177444d97184dbcde16a09c
3
+ size 3037831016
qor4b.exe ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7bdafccab358669651fc0c7a9fae14f311c7f2ce566d71c36bc162509b40316a
3
+ size 6192128
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f9e4d4901a92b997e463c1f46055088b6cca5ca61a6522d1b9f64c4bb81cb42
3
+ size 12807982