File size: 7,795 Bytes
ebf9f7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2902126
ebf9f7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2902126
ebf9f7c
 
 
 
 
 
 
 
2902126
 
ebf9f7c
 
2902126
ebf9f7c
2902126
ebf9f7c
2902126
 
 
 
 
 
ebf9f7c
 
 
 
 
 
 
 
 
 
2902126
 
ebf9f7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2902126
 
 
 
 
ebf9f7c
 
2902126
ebf9f7c
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
language:
  - en
  - multilingual
license: apache-2.0
tags:
  - rust
  - cpu-inference
  - quantized
  - q4
  - image-classification
  - zero-shot-classification
  - image-embedding
  - siglip
  - vision-transformer
  - pure-rust
  - no-python
  - no-cuda
  - contrastive-learning
base_model: google/siglip2-base-patch16-224
library_name: qora
pipeline_tag: zero-shot-image-classification
model-index:
  - name: QORA-Vision-Image
    results:
      - task:
          type: zero-shot-image-classification
        dataset:
          name: ImageNet-1K
          type: imagenet-1k
        metrics:
          - name: Zero-shot Accuracy
            type: accuracy
            value: 69.8
---

# QORA-Vision (Image) - Native Rust Image Encoder

Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.

## Overview

| Property | Value |
|----------|-------|
| **Engine** | QORA-Vision (Pure Rust) |
| **Base Model** | SigLIP 2 Base (google/siglip2-base-patch16-224) |
| **Vision Params** | ~93M |
| **Text Params** | ~283M (256K vocab) |
| **Quantization** | Q4 (4-bit symmetric, group_size=32) |
| **Model Size** | 210 MB (Q4 binary, vision + text) |
| **Executable** | 4.4 MB |
| **Input** | 224x224 RGB images (PNG/JPEG) |
| **Output** | 768-dim embeddings + zero-shot classification scores |
| **Platform** | Windows x86_64 (CPU-only) |

## Architecture

### Vision Encoder (12-layer ViT-Base)

| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Attention Heads** | 12 (head_dim=64) |
| **MLP (Intermediate)** | 3,072 (GELU-Tanh activation) |
| **Patch Size** | 16x16 (non-overlapping) |
| **Sequence Length** | 196 patches (14x14 grid) |
| **Normalization** | LayerNorm with bias (eps=1e-6) |
| **Attention** | Bidirectional (no causal mask) |
| **Position Encoding** | Learned position embeddings |
| **Pooling** | MAP (Multi-head Attention Pooling) |

### Text Encoder (12-layer ViT-Base)

| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Vocabulary** | 256,000 tokens |
| **Max Position** | 64 tokens |
| **Pooling** | Last token + linear head |

### Contrastive Scoring

```
score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)
```

## Pipeline

```
Image (224x224) β†’ Patch Embedding (196 patches)
    β†’ Add Position Embeddings
    β†’ 12x ViT Transformer Layers (bidirectional)
    β†’ Post-LayerNorm
    β†’ MAP Pooling (cross-attention with learned probe)
    β†’ L2 Normalize
    β†’ 768-dim Image Embedding

Text β†’ Tokenize β†’ Token + Position Embedding
    β†’ 12x ViT Transformer Layers
    β†’ Final LayerNorm (last token)
    β†’ Linear Head
    β†’ L2 Normalize
    β†’ 768-dim Text Embedding

Score = sigmoid(cosine_sim * exp(scale) + bias)
```

## Files

```
siglip-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 210 MB    Full model (vision + text, Q4)
  tokenizer.json       - 33 MB     Text tokenizer (256K vocab)
  config.json          - 611 B     QORA-branded config
  README.md            - This file
```

## Usage

```bash
# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"

# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"

# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg

# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"

# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision
```

### CLI Arguments

| Flag | Default | Description |
|------|---------|-------------|
| `--model-path <path>` | `.` | Path to model directory (safetensors) |
| `--image <path>` | - | Input image (PNG/JPEG) |
| `--labels <list>` | - | Comma-separated labels for zero-shot |
| `--text <string>` | - | Text for similarity scoring |
| `--load <path>` | - | Load binary (.qora-vision, includes vision + text) |
| `--save <path>` | - | Save full model binary (vision + text + scale/bias) |
| `--f16` | off | Use F16 weights instead of Q4 |

## Published Benchmarks

### SigLIP 2 Base (224px) - Published Scores

| Benchmark | Score |
|-----------|-------|
| **ImageNet-1K Zero-shot** | ~69.8% |
| **Multilingual support** | Yes (trained on WebLI) |

SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.

### Model Comparison

| Model | Params | Image Size | Architecture | Zero-shot ImageNet |
|-------|--------|------------|-------------|-------------------|
| **QORA-Vision (SigLIP 2 Base)** | 93M | 224 | ViT-B/16 | ~69.8% |
| CLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 68.3% |
| SigLIP Base (v1) | 86M | 224 | ViT-B/16 | 66.2% |
| OpenCLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 67.0% |

## Test Results

All tests run with Q4 quantization on CPU.

### Test 1: Red Image Classification

**Input:** Solid red 224x224 image
**Labels:** red, blue, green, yellow

| Label | Score |
|-------|-------|
| **red** | **0.0022** |
| blue | 0.0000 |
| green | 0.0000 |
| yellow | 0.0000 |

| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "red") |
| Vision Forward | 42.0s |
| Embedding Dim | 768, L2 norm = 1.0000 |

### Test 2: Blue Image Classification

**Input:** Solid blue 224x224 image
**Labels:** red, blue, green, yellow

| Label | Score |
|-------|-------|
| red | 0.0000 |
| **blue** | **0.0014** |
| green | 0.0000 |
| yellow | 0.0000 |

| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "blue") |
| Vision Forward | 31.5s |

### Test 3: Green Image with Natural Language Labels

**Input:** Solid green 224x224 image
**Labels:** "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"

| Label | Score |
|-------|-------|
| a photo of a cat | 0.0000 |
| a photo of a dog | 0.0000 |
| **a solid green image** | **0.0176** |
| a landscape | 0.0000 |

| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified natural language description) |
| Vision Forward | 39.2s |
| Note | Highest score by far, demonstrating text understanding |

### Test Summary

| Test | Input | Best Label | Correct? | Score |
|------|-------|------------|----------|-------|
| Color (red) | Solid red | "red" | PASS | 0.0022 |
| Color (blue) | Solid blue | "blue" | PASS | 0.0014 |
| NL Description | Solid green | "a solid green image" | PASS | 0.0176 |
| **Overall** | | | **3/3 (100%)** | |

## Performance

| Metric | Value |
|--------|-------|
| **Binary Load** | ~115ms (full model, 210 MB) |
| **Safetensors Load** | ~11-20s (from safetensors) |
| **Vision Forward** | ~13-20s (196 tokens, 12 layers) |
| **Text Forward** | ~5s per label |
| **Total (4 labels)** | ~33-55s |
| **Memory (Vision Q4)** | 58 MB |
| **Memory (Text Q4)** | 151 MB |
| **Binary Save** | ~2s (210 MB) |

## QORA Model Family

| Engine | Model | Params | Size (Q4) | Purpose |
|--------|-------|--------|-----------|---------|
| **QORA** | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat |
| **QORA-TTS** | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis |
| **QORA-Vision (Image)** | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification |
| **QORA-Vision (Video)** | ViViT Base | 89M | 60 MB | Video action classification |

---

*Built with QORA - Pure Rust AI Inference*