Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.
Image (224x224) β Patch Embedding (196 patches)
β Add Position Embeddings
β 12x ViT Transformer Layers (bidirectional)
β Post-LayerNorm
β MAP Pooling (cross-attention with learned probe)
β L2 Normalize
β 768-dim Image Embedding
Text β Tokenize β Token + Position Embedding
β 12x ViT Transformer Layers
β Final LayerNorm (last token)
β Linear Head
β L2 Normalize
β 768-dim Text Embedding
Score = sigmoid(cosine_sim * exp(scale) + bias)
Files
siglip-model/
qora-vision.exe - 4.4 MB Inference engine
model.qora-vision - 210 MB Full model (vision + text, Q4)
tokenizer.json - 33 MB Text tokenizer (256K vocab)
config.json - 611 B QORA-branded config
README.md - This file
Usage
# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg
# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision
CLI Arguments
Flag
Default
Description
--model-path <path>
.
Path to model directory (safetensors)
--image <path>
-
Input image (PNG/JPEG)
--labels <list>
-
Comma-separated labels for zero-shot
--text <string>
-
Text for similarity scoring
--load <path>
-
Load binary (.qora-vision, includes vision + text)
--save <path>
-
Save full model binary (vision + text + scale/bias)
--f16
off
Use F16 weights instead of Q4
Published Benchmarks
SigLIP 2 Base (224px) - Published Scores
Benchmark
Score
ImageNet-1K Zero-shot
~69.8%
Multilingual support
Yes (trained on WebLI)
SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.
Model Comparison
Model
Params
Image Size
Architecture
Zero-shot ImageNet
QORA-Vision (SigLIP 2 Base)
93M
224
ViT-B/16
~69.8%
CLIP ViT-B/16
86M
224
ViT-B/16
68.3%
SigLIP Base (v1)
86M
224
ViT-B/16
66.2%
OpenCLIP ViT-B/16
86M
224
ViT-B/16
67.0%
Test Results
All tests run with Q4 quantization on CPU.
Test 1: Red Image Classification
Input: Solid red 224x224 image
Labels: red, blue, green, yellow
Label
Score
red
0.0022
blue
0.0000
green
0.0000
yellow
0.0000
Metric
Value
Result
PASS (correctly identified "red")
Vision Forward
42.0s
Embedding Dim
768, L2 norm = 1.0000
Test 2: Blue Image Classification
Input: Solid blue 224x224 image
Labels: red, blue, green, yellow
Label
Score
red
0.0000
blue
0.0014
green
0.0000
yellow
0.0000
Metric
Value
Result
PASS (correctly identified "blue")
Vision Forward
31.5s
Test 3: Green Image with Natural Language Labels
Input: Solid green 224x224 image
Labels: "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"
Label
Score
a photo of a cat
0.0000
a photo of a dog
0.0000
a solid green image
0.0176
a landscape
0.0000
Metric
Value
Result
PASS (correctly identified natural language description)
Vision Forward
39.2s
Note
Highest score by far, demonstrating text understanding