QORA-4B: Pure Rust multimodal inference engine

Based on Qwen3.5-4B. Q4 quantized text + F16 vision weights.
Text, image, and video understanding with thinking mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (5) hide show

.gitattributes +3 -0
README.md +199 -0
model.qor4b +3 -0
qor4b.exe +3 -0
tokenizer.json +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,3 @@

+*.qor4b filter=lfs diff=lfs merge=lfs -text
+*.exe filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen3.5-4B
+language:
+  - en
+  - zh
+  - multilingual
+library_name: rust
+tags:
+  - text-generation
+  - image-text-to-text
+  - video-text-to-text
+  - multimodal
+  - vision
+  - rust
+  - pure-rust
+  - no-python
+  - quantized
+  - deltanet
+  - hybrid-attention
+pipeline_tag: image-text-to-text
+model-index:
+  - name: QORA-4B
+    results: []
+---
+# QORA-4B
+Pure Rust multimodal inference engine based on [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B). No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable AI that runs on any machine.
+## License
+This project is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). The base model [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) is released by the Qwen team under Apache 2.0.
+## What It Does
+QORA-4B is a 4-billion parameter language model with built-in vision. It can:
+- **Text generation** — answer questions, write code, reason through problems
+- **Image understanding** — describe photos, answer questions about images
+- **Video understanding** — analyze frame sequences, describe motion and temporal changes
+- **Thinking mode** — extended chain-of-thought reasoning with configurable budget
+## Architecture
+QORA-4B uses a hybrid architecture combining two attention mechanisms:
+| Component | Details |
+|-----------|---------|
+| **Parameters** | 4B total |
+| **Hidden dim** | 2560 |
+| **Layers** | 32 (24 DeltaNet + 8 Full Attention) |
+| **Layer pattern** | 3x DeltaNet + 1x Full Attention, repeated 8 times |
+| **Vocabulary** | 248,320 tokens |
+| **Context** | 262K tokens natively |
+### DeltaNet Layers (24 of 32)
+- Gated linear attention with delta rule state updates
+- 16 QK heads + 32 V heads, head_dim=128
+- Causal Conv1d (kernel=4) + SiLU activation
+- O(1) memory per token (recurrent state, no KV cache needed)
+### Full Attention Layers (8 of 32)
+- Grouped Query Attention (16Q / 4KV heads), head_dim=256
+- QK-norm + partial RoPE (64/256 dims rotated), theta=10M
+- Output gating (sigmoid gate on attention output)
+- Standard KV cache
+### Vision Encoder
+- 24-layer ViT, hidden=1024, 16 heads
+- Conv3d patch embedding [1024, 3, 2, 16, 16] (temporal_patch_size=2)
+- Learned positional embedding with bilinear interpolation from 48x48 grid
+- 2D spatial RoPE (dim=32, theta=10000)
+- 2x2 spatial merger: LayerNorm → concat → MLP(4096 → 2560)
+- **Images**: single frame duplicated along temporal axis
+- **Video**: actual Conv3d over consecutive frame pairs (N frames → N/2 temporal patches)
+## Weight Formats
+| Format | Size | Quality | Speed |
+|--------|------|---------|-------|
+| **Q4** (default) | ~2.9 GB | Good | ~0.9 tok/s |
+| **F16** | ~7.5 GB | Best | ~0.5 tok/s |
+Q4 uses 4-bit symmetric quantization with group_size=32 and LUT-optimized dequantization. Multi-threaded GEMV/GEMM via rayon for large matrices.
+## Quick Start
+1. Download `qor4b.exe`, `model.qor4b`, and `tokenizer.json` into the same folder
+2. Run:
+```bash
+# Text generation
+qor4b --prompt "Explain quantum computing" --max-tokens 500
+# Image understanding
+qor4b --prompt "What's in this image?" --image photo.jpg
+# Video understanding (directory of frame images)
+qor4b --prompt "What happens in this video?" --video frames_dir/
+# Thinking mode (default, extended reasoning)
+qor4b --prompt "Solve: integral of x^2 * e^x dx" --think-budget 2048
+# No-think mode (faster, direct answers)
+qor4b --prompt "What is 2+2?" --no-think
+# Greedy decoding (deterministic output)
+qor4b --prompt "Hello" --greedy
+```
+### CLI Flags
+| Flag | Description |
+|------|-------------|
+| `--prompt TEXT` | Input prompt (default: "Hello, how are you?") |
+| `--image PATH` | Path to an image file (PNG/JPG) |
+| `--video PATH` | Path to directory of frame images (PNG/JPG, sorted by name) |
+| `--max-tokens N` | Max tokens to generate (default: 1024) |
+| `--think-budget N` | Max thinking tokens before forcing answer (default: 1024) |
+| `--no-think` | Disable thinking mode (direct answers) |
+| `--show-think` | Display thinking tokens on stderr |
+| `--greedy` | Greedy decoding (temperature=0, not recommended with thinking mode) |
+### Sampling Defaults
+| Parameter | Think mode | No-think mode |
+|-----------|-----------|---------------|
+| temperature | 1.0 | 0.7 |
+| top_k | 20 | 20 |
+| top_p | 0.95 | 0.95 |
+| presence_penalty | 1.5 | 1.5 |
+### Video Input
+Video is provided as a directory of frame images (not a video file). Extract frames however you like:
+```bash
+# Example: extract 4 frames from a video with ffmpeg
+ffmpeg -i video.mp4 -vf "select=not(mod(n\,30))" -frames:v 4 frames/frame_%02d.png
+# Then run
+qor4b --prompt "Describe what happens" --video frames/
+```
+Frames are loaded in alphabetical order, resized to uniform dimensions (max 768px, divisible by 32), and processed as temporal pairs via Conv3d. Odd frame counts are padded by duplicating the last frame.
+**Performance guide:**
+- 4 frames @ 256x256: ~180s vision encoder, 128 merged tokens
+- 8 frames @ 256x256: ~10min vision encoder, 256 merged tokens
+- Keep frames small and few for interactive use
+## Built With
+- **Language**: Pure Rust (2024 edition)
+- **Dependencies**: `half` (f16), `rayon` (parallelism), `image` (image loading), `tokenizers` (HuggingFace tokenizer), `memmap2` (mmap for converter), `serde_json` (config parsing)
+- **No ML framework** for inference — all matrix ops are hand-written Rust
+- **Burn framework** used only as a build dependency (for binary format types)
+## File Structure
+```
+src/
+  main.rs      — CLI entry point, argument parsing
+  config.rs    — Model architecture configuration
+  gemv.rs      — GEMV/GEMM kernels (F16 + Q4), hybrid forward pass, prefill
+  generate.rs  — Text generation loop (text, image, video modes)
+  tokenizer.rs — Tokenizer wrapper and chat templates
+  vision.rs    — Vision encoder (ViT + merger), image/video loading
+  save.rs      — Binary model format (.qor4b) save/load
+  convert.rs   — One-time safetensors → .qor4b converter
+  lib.rs       — Module exports
+```
+## Model Binary Format (.qor4b)
+Custom binary format for fast loading:
+```
+Header:  "QOR4" magic + version(u32) + format(u8: 0=F16, 1=Q4)
+Config:  Architecture params (vocab, hidden, layers, heads, etc.)
+Layers:  32 layers, each with type byte + layer-specific weights
+Global:  Embedding + final norm + precomputed RoPE tables
+Vision:  Conv3d patch embed + pos_embed + 24 ViT blocks + merger MLP
+```
+Loading is ~30s for the Q4 model (~2.9 GB) via buffered sequential reads.
+## Performance
+Tested on i5-11500 (6C/12T), 16GB RAM, CPU-only:
+| Task | Speed |
+|------|-------|
+| Text decode | ~0.9 tok/s (Q4) |
+| Text prefill | ~1.0 tok/s |
+| Image encode (256x256) | ~90s |
+| Video encode (4 frames, 256x256) | ~180s |
+| Model load (Q4) | ~37s |

model.qor4b ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f130350857c0992291b689af2ef24e4d46ec79e96177444d97184dbcde16a09c
+size 3037831016

qor4b.exe ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7bdafccab358669651fc0c7a9fae14f311c7f2ce566d71c36bc162509b40316a
+size 6192128

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f9e4d4901a92b997e463c1f46055088b6cca5ca61a6522d1b9f64c4bb81cb42
+size 12807982