QORA-Vision (Image) - Native Rust Image Encoder
Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.
Overview
| Property |
Value |
| Engine |
QORA-Vision (Pure Rust) |
| Base Model |
SigLIP 2 Base (google/siglip2-base-patch16-224) |
| Vision Params |
~93M |
| Text Params |
~283M (256K vocab) |
| Quantization |
Q4 (4-bit symmetric, group_size=32) |
| Model Size |
210 MB (Q4 binary, vision + text) |
| Executable |
4.4 MB |
| Input |
224x224 RGB images (PNG/JPEG) |
| Output |
768-dim embeddings + zero-shot classification scores |
| Platform |
Windows x86_64 (CPU-only) |
Architecture
Vision Encoder (12-layer ViT-Base)
| Component |
Details |
| Layers |
12 transformer layers |
| Hidden Size |
768 |
| Attention Heads |
12 (head_dim=64) |
| MLP (Intermediate) |
3,072 (GELU-Tanh activation) |
| Patch Size |
16x16 (non-overlapping) |
| Sequence Length |
196 patches (14x14 grid) |
| Normalization |
LayerNorm with bias (eps=1e-6) |
| Attention |
Bidirectional (no causal mask) |
| Position Encoding |
Learned position embeddings |
| Pooling |
MAP (Multi-head Attention Pooling) |
Text Encoder (12-layer ViT-Base)
| Component |
Details |
| Layers |
12 transformer layers |
| Hidden Size |
768 |
| Vocabulary |
256,000 tokens |
| Max Position |
64 tokens |
| Pooling |
Last token + linear head |
Contrastive Scoring
score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)
Pipeline
Image (224x224) β Patch Embedding (196 patches)
β Add Position Embeddings
β 12x ViT Transformer Layers (bidirectional)
β Post-LayerNorm
β MAP Pooling (cross-attention with learned probe)
β L2 Normalize
β 768-dim Image Embedding
Text β Tokenize β Token + Position Embedding
β 12x ViT Transformer Layers
β Final LayerNorm (last token)
β Linear Head
β L2 Normalize
β 768-dim Text Embedding
Score = sigmoid(cosine_sim * exp(scale) + bias)
Files
siglip-model/
qora-vision.exe - 4.4 MB Inference engine
model.qora-vision - 210 MB Full model (vision + text, Q4)
tokenizer.json - 33 MB Text tokenizer (256K vocab)
config.json - 611 B QORA-branded config
README.md - This file
Usage
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"
qora-vision.exe siglip --load model.qora-vision --image photo.jpg
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision
CLI Arguments
| Flag |
Default |
Description |
--model-path <path> |
. |
Path to model directory (safetensors) |
--image <path> |
- |
Input image (PNG/JPEG) |
--labels <list> |
- |
Comma-separated labels for zero-shot |
--text <string> |
- |
Text for similarity scoring |
--load <path> |
- |
Load binary (.qora-vision, includes vision + text) |
--save <path> |
- |
Save full model binary (vision + text + scale/bias) |
--f16 |
off |
Use F16 weights instead of Q4 |
Published Benchmarks
SigLIP 2 Base (224px) - Published Scores
| Benchmark |
Score |
| ImageNet-1K Zero-shot |
~69.8% |
| Multilingual support |
Yes (trained on WebLI) |
SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.
Model Comparison
| Model |
Params |
Image Size |
Architecture |
Zero-shot ImageNet |
| QORA-Vision (SigLIP 2 Base) |
93M |
224 |
ViT-B/16 |
~69.8% |
| CLIP ViT-B/16 |
86M |
224 |
ViT-B/16 |
68.3% |
| SigLIP Base (v1) |
86M |
224 |
ViT-B/16 |
66.2% |
| OpenCLIP ViT-B/16 |
86M |
224 |
ViT-B/16 |
67.0% |
Test Results
All tests run with Q4 quantization on CPU.
Test 1: Red Image Classification
Input: Solid red 224x224 image
Labels: red, blue, green, yellow
| Label |
Score |
| red |
0.0022 |
| blue |
0.0000 |
| green |
0.0000 |
| yellow |
0.0000 |
| Metric |
Value |
| Result |
PASS (correctly identified "red") |
| Vision Forward |
42.0s |
| Embedding Dim |
768, L2 norm = 1.0000 |
Test 2: Blue Image Classification
Input: Solid blue 224x224 image
Labels: red, blue, green, yellow
| Label |
Score |
| red |
0.0000 |
| blue |
0.0014 |
| green |
0.0000 |
| yellow |
0.0000 |
| Metric |
Value |
| Result |
PASS (correctly identified "blue") |
| Vision Forward |
31.5s |
Test 3: Green Image with Natural Language Labels
Input: Solid green 224x224 image
Labels: "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"
| Label |
Score |
| a photo of a cat |
0.0000 |
| a photo of a dog |
0.0000 |
| a solid green image |
0.0176 |
| a landscape |
0.0000 |
| Metric |
Value |
| Result |
PASS (correctly identified natural language description) |
| Vision Forward |
39.2s |
| Note |
Highest score by far, demonstrating text understanding |
Test Summary
| Test |
Input |
Best Label |
Correct? |
Score |
| Color (red) |
Solid red |
"red" |
PASS |
0.0022 |
| Color (blue) |
Solid blue |
"blue" |
PASS |
0.0014 |
| NL Description |
Solid green |
"a solid green image" |
PASS |
0.0176 |
| Overall |
|
|
3/3 (100%) |
|
Performance
| Metric |
Value |
| Binary Load |
~115ms (full model, 210 MB) |
| Safetensors Load |
~11-20s (from safetensors) |
| Vision Forward |
~13-20s (196 tokens, 12 layers) |
| Text Forward |
~5s per label |
| Total (4 labels) |
~33-55s |
| Memory (Vision Q4) |
58 MB |
| Memory (Text Q4) |
151 MB |
| Binary Save |
~2s (210 MB) |
QORA Model Family
| Engine |
Model |
Params |
Size (Q4) |
Purpose |
| QORA |
SmolLM3-3B |
3.07B |
1.68 GB |
Text generation, reasoning, chat |
| QORA-TTS |
Qwen3-TTS |
1.84B |
1.5 GB |
Text-to-speech synthesis |
| QORA-Vision (Image) |
SigLIP 2 Base |
93M |
58 MB |
Image embeddings, zero-shot classification |
| QORA-Vision (Video) |
ViViT Base |
89M |
60 MB |
Video action classification |
Built with QORA - Pure Rust AI Inference