QORA-Vision (Image) - Native Rust Image Encoder

Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.

Overview

Property	Value
Engine	QORA-Vision (Pure Rust)
Base Model	SigLIP 2 Base (google/siglip2-base-patch16-224)
Vision Params	~93M
Text Params	~283M (256K vocab)
Quantization	Q4 (4-bit symmetric, group_size=32)
Model Size	210 MB (Q4 binary, vision + text)
Executable	4.4 MB
Input	224x224 RGB images (PNG/JPEG)
Output	768-dim embeddings + zero-shot classification scores
Platform	Windows x86_64 (CPU-only)

Architecture

Vision Encoder (12-layer ViT-Base)

Component	Details
Layers	12 transformer layers
Hidden Size	768
Attention Heads	12 (head_dim=64)
MLP (Intermediate)	3,072 (GELU-Tanh activation)
Patch Size	16x16 (non-overlapping)
Sequence Length	196 patches (14x14 grid)
Normalization	LayerNorm with bias (eps=1e-6)
Attention	Bidirectional (no causal mask)
Position Encoding	Learned position embeddings
Pooling	MAP (Multi-head Attention Pooling)

Text Encoder (12-layer ViT-Base)

Component	Details
Layers	12 transformer layers
Hidden Size	768
Vocabulary	256,000 tokens
Max Position	64 tokens
Pooling	Last token + linear head

Contrastive Scoring

score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)

Pipeline

Image (224x224) → Patch Embedding (196 patches)
    → Add Position Embeddings
    → 12x ViT Transformer Layers (bidirectional)
    → Post-LayerNorm
    → MAP Pooling (cross-attention with learned probe)
    → L2 Normalize
    → 768-dim Image Embedding

Text → Tokenize → Token + Position Embedding
    → 12x ViT Transformer Layers
    → Final LayerNorm (last token)
    → Linear Head
    → L2 Normalize
    → 768-dim Text Embedding

Score = sigmoid(cosine_sim * exp(scale) + bias)

Files

siglip-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 210 MB    Full model (vision + text, Q4)
  tokenizer.json       - 33 MB     Text tokenizer (256K vocab)
  config.json          - 611 B     QORA-branded config
  README.md            - This file

Usage

# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"

# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"

# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg

# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"

# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision

CLI Arguments

Flag	Default	Description
`--model-path <path>`	`.`	Path to model directory (safetensors)
`--image <path>`	-	Input image (PNG/JPEG)
`--labels <list>`	-	Comma-separated labels for zero-shot
`--text <string>`	-	Text for similarity scoring
`--load <path>`	-	Load binary (.qora-vision, includes vision + text)
`--save <path>`	-	Save full model binary (vision + text + scale/bias)
`--f16`	off	Use F16 weights instead of Q4

Published Benchmarks

SigLIP 2 Base (224px) - Published Scores

Benchmark	Score
ImageNet-1K Zero-shot	~69.8%
Multilingual support	Yes (trained on WebLI)

SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.

Model Comparison

Model	Params	Image Size	Architecture	Zero-shot ImageNet
QORA-Vision (SigLIP 2 Base)	93M	224	ViT-B/16	~69.8%
CLIP ViT-B/16	86M	224	ViT-B/16	68.3%
SigLIP Base (v1)	86M	224	ViT-B/16	66.2%
OpenCLIP ViT-B/16	86M	224	ViT-B/16	67.0%

Test Results

All tests run with Q4 quantization on CPU.

Test 1: Red Image Classification

Input: Solid red 224x224 image Labels: red, blue, green, yellow

Label	Score
red	0.0022
blue	0.0000
green	0.0000
yellow	0.0000

Metric	Value
Result	PASS (correctly identified "red")
Vision Forward	42.0s
Embedding Dim	768, L2 norm = 1.0000

Test 2: Blue Image Classification

Input: Solid blue 224x224 image Labels: red, blue, green, yellow

Label	Score
red	0.0000
blue	0.0014
green	0.0000
yellow	0.0000

Metric	Value
Result	PASS (correctly identified "blue")
Vision Forward	31.5s

Test 3: Green Image with Natural Language Labels

Input: Solid green 224x224 image Labels: "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"

Label	Score
a photo of a cat	0.0000
a photo of a dog	0.0000
a solid green image	0.0176
a landscape	0.0000

Metric	Value
Result	PASS (correctly identified natural language description)
Vision Forward	39.2s
Note	Highest score by far, demonstrating text understanding

Test Summary

Test	Input	Best Label	Correct?	Score
Color (red)	Solid red	"red"	PASS	0.0022
Color (blue)	Solid blue	"blue"	PASS	0.0014
NL Description	Solid green	"a solid green image"	PASS	0.0176
Overall			3/3 (100%)

Performance

Metric	Value
Binary Load	~115ms (full model, 210 MB)
Safetensors Load	~11-20s (from safetensors)
Vision Forward	~13-20s (196 tokens, 12 layers)
Text Forward	~5s per label
Total (4 labels)	~33-55s
Memory (Vision Q4)	58 MB
Memory (Text Q4)	151 MB
Binary Save	~2s (210 MB)

QORA Model Family

Engine	Model	Params	Size (Q4)	Purpose
QORA	SmolLM3-3B	3.07B	1.68 GB	Text generation, reasoning, chat
QORA-TTS	Qwen3-TTS	1.84B	1.5 GB	Text-to-speech synthesis
QORA-Vision (Image)	SigLIP 2 Base	93M	58 MB	Image embeddings, zero-shot classification
QORA-Vision (Video)	ViViT Base	89M	60 MB	Video action classification

Built with QORA - Pure Rust AI Inference

Downloads last month: 2

Model tree for qoranet/QORA-Vision-Image

Base model

google/siglip2-base-patch16-224

Finetuned

(119)

this model

Evaluation results

Zero-shot Accuracy on ImageNet-1K
self-reported

69.800