File size: 7,795 Bytes
ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c 2902126 ebf9f7c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | ---
language:
- en
- multilingual
license: apache-2.0
tags:
- rust
- cpu-inference
- quantized
- q4
- image-classification
- zero-shot-classification
- image-embedding
- siglip
- vision-transformer
- pure-rust
- no-python
- no-cuda
- contrastive-learning
base_model: google/siglip2-base-patch16-224
library_name: qora
pipeline_tag: zero-shot-image-classification
model-index:
- name: QORA-Vision-Image
results:
- task:
type: zero-shot-image-classification
dataset:
name: ImageNet-1K
type: imagenet-1k
metrics:
- name: Zero-shot Accuracy
type: accuracy
value: 69.8
---
# QORA-Vision (Image) - Native Rust Image Encoder
Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.
## Overview
| Property | Value |
|----------|-------|
| **Engine** | QORA-Vision (Pure Rust) |
| **Base Model** | SigLIP 2 Base (google/siglip2-base-patch16-224) |
| **Vision Params** | ~93M |
| **Text Params** | ~283M (256K vocab) |
| **Quantization** | Q4 (4-bit symmetric, group_size=32) |
| **Model Size** | 210 MB (Q4 binary, vision + text) |
| **Executable** | 4.4 MB |
| **Input** | 224x224 RGB images (PNG/JPEG) |
| **Output** | 768-dim embeddings + zero-shot classification scores |
| **Platform** | Windows x86_64 (CPU-only) |
## Architecture
### Vision Encoder (12-layer ViT-Base)
| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Attention Heads** | 12 (head_dim=64) |
| **MLP (Intermediate)** | 3,072 (GELU-Tanh activation) |
| **Patch Size** | 16x16 (non-overlapping) |
| **Sequence Length** | 196 patches (14x14 grid) |
| **Normalization** | LayerNorm with bias (eps=1e-6) |
| **Attention** | Bidirectional (no causal mask) |
| **Position Encoding** | Learned position embeddings |
| **Pooling** | MAP (Multi-head Attention Pooling) |
### Text Encoder (12-layer ViT-Base)
| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Vocabulary** | 256,000 tokens |
| **Max Position** | 64 tokens |
| **Pooling** | Last token + linear head |
### Contrastive Scoring
```
score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)
```
## Pipeline
```
Image (224x224) β Patch Embedding (196 patches)
β Add Position Embeddings
β 12x ViT Transformer Layers (bidirectional)
β Post-LayerNorm
β MAP Pooling (cross-attention with learned probe)
β L2 Normalize
β 768-dim Image Embedding
Text β Tokenize β Token + Position Embedding
β 12x ViT Transformer Layers
β Final LayerNorm (last token)
β Linear Head
β L2 Normalize
β 768-dim Text Embedding
Score = sigmoid(cosine_sim * exp(scale) + bias)
```
## Files
```
siglip-model/
qora-vision.exe - 4.4 MB Inference engine
model.qora-vision - 210 MB Full model (vision + text, Q4)
tokenizer.json - 33 MB Text tokenizer (256K vocab)
config.json - 611 B QORA-branded config
README.md - This file
```
## Usage
```bash
# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"
# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"
# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg
# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"
# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision
```
### CLI Arguments
| Flag | Default | Description |
|------|---------|-------------|
| `--model-path <path>` | `.` | Path to model directory (safetensors) |
| `--image <path>` | - | Input image (PNG/JPEG) |
| `--labels <list>` | - | Comma-separated labels for zero-shot |
| `--text <string>` | - | Text for similarity scoring |
| `--load <path>` | - | Load binary (.qora-vision, includes vision + text) |
| `--save <path>` | - | Save full model binary (vision + text + scale/bias) |
| `--f16` | off | Use F16 weights instead of Q4 |
## Published Benchmarks
### SigLIP 2 Base (224px) - Published Scores
| Benchmark | Score |
|-----------|-------|
| **ImageNet-1K Zero-shot** | ~69.8% |
| **Multilingual support** | Yes (trained on WebLI) |
SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.
### Model Comparison
| Model | Params | Image Size | Architecture | Zero-shot ImageNet |
|-------|--------|------------|-------------|-------------------|
| **QORA-Vision (SigLIP 2 Base)** | 93M | 224 | ViT-B/16 | ~69.8% |
| CLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 68.3% |
| SigLIP Base (v1) | 86M | 224 | ViT-B/16 | 66.2% |
| OpenCLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 67.0% |
## Test Results
All tests run with Q4 quantization on CPU.
### Test 1: Red Image Classification
**Input:** Solid red 224x224 image
**Labels:** red, blue, green, yellow
| Label | Score |
|-------|-------|
| **red** | **0.0022** |
| blue | 0.0000 |
| green | 0.0000 |
| yellow | 0.0000 |
| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "red") |
| Vision Forward | 42.0s |
| Embedding Dim | 768, L2 norm = 1.0000 |
### Test 2: Blue Image Classification
**Input:** Solid blue 224x224 image
**Labels:** red, blue, green, yellow
| Label | Score |
|-------|-------|
| red | 0.0000 |
| **blue** | **0.0014** |
| green | 0.0000 |
| yellow | 0.0000 |
| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "blue") |
| Vision Forward | 31.5s |
### Test 3: Green Image with Natural Language Labels
**Input:** Solid green 224x224 image
**Labels:** "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"
| Label | Score |
|-------|-------|
| a photo of a cat | 0.0000 |
| a photo of a dog | 0.0000 |
| **a solid green image** | **0.0176** |
| a landscape | 0.0000 |
| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified natural language description) |
| Vision Forward | 39.2s |
| Note | Highest score by far, demonstrating text understanding |
### Test Summary
| Test | Input | Best Label | Correct? | Score |
|------|-------|------------|----------|-------|
| Color (red) | Solid red | "red" | PASS | 0.0022 |
| Color (blue) | Solid blue | "blue" | PASS | 0.0014 |
| NL Description | Solid green | "a solid green image" | PASS | 0.0176 |
| **Overall** | | | **3/3 (100%)** | |
## Performance
| Metric | Value |
|--------|-------|
| **Binary Load** | ~115ms (full model, 210 MB) |
| **Safetensors Load** | ~11-20s (from safetensors) |
| **Vision Forward** | ~13-20s (196 tokens, 12 layers) |
| **Text Forward** | ~5s per label |
| **Total (4 labels)** | ~33-55s |
| **Memory (Vision Q4)** | 58 MB |
| **Memory (Text Q4)** | 151 MB |
| **Binary Save** | ~2s (210 MB) |
## QORA Model Family
| Engine | Model | Params | Size (Q4) | Purpose |
|--------|-------|--------|-----------|---------|
| **QORA** | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat |
| **QORA-TTS** | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis |
| **QORA-Vision (Image)** | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification |
| **QORA-Vision (Video)** | ViViT Base | 89M | 60 MB | Video action classification |
---
*Built with QORA - Pure Rust AI Inference*
|