QORA-Vision (Image) - Native Rust Image Encoder

Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.

Overview

Property Value
Engine QORA-Vision (Pure Rust)
Base Model SigLIP 2 Base (google/siglip2-base-patch16-224)
Vision Params ~93M
Text Params ~283M (256K vocab)
Quantization Q4 (4-bit symmetric, group_size=32)
Model Size 210 MB (Q4 binary, vision + text)
Executable 4.4 MB
Input 224x224 RGB images (PNG/JPEG)
Output 768-dim embeddings + zero-shot classification scores
Platform Windows x86_64 (CPU-only)

Architecture

Vision Encoder (12-layer ViT-Base)

Component Details
Layers 12 transformer layers
Hidden Size 768
Attention Heads 12 (head_dim=64)
MLP (Intermediate) 3,072 (GELU-Tanh activation)
Patch Size 16x16 (non-overlapping)
Sequence Length 196 patches (14x14 grid)
Normalization LayerNorm with bias (eps=1e-6)
Attention Bidirectional (no causal mask)
Position Encoding Learned position embeddings
Pooling MAP (Multi-head Attention Pooling)

Text Encoder (12-layer ViT-Base)

Component Details
Layers 12 transformer layers
Hidden Size 768
Vocabulary 256,000 tokens
Max Position 64 tokens
Pooling Last token + linear head

Contrastive Scoring

score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)

Pipeline

Image (224x224) β†’ Patch Embedding (196 patches)
    β†’ Add Position Embeddings
    β†’ 12x ViT Transformer Layers (bidirectional)
    β†’ Post-LayerNorm
    β†’ MAP Pooling (cross-attention with learned probe)
    β†’ L2 Normalize
    β†’ 768-dim Image Embedding

Text β†’ Tokenize β†’ Token + Position Embedding
    β†’ 12x ViT Transformer Layers
    β†’ Final LayerNorm (last token)
    β†’ Linear Head
    β†’ L2 Normalize
    β†’ 768-dim Text Embedding

Score = sigmoid(cosine_sim * exp(scale) + bias)

Files

siglip-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 210 MB    Full model (vision + text, Q4)
  tokenizer.json       - 33 MB     Text tokenizer (256K vocab)
  config.json          - 611 B     QORA-branded config
  README.md            - This file

Usage

# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"

# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"

# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg

# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"

# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision

CLI Arguments

Flag Default Description
--model-path <path> . Path to model directory (safetensors)
--image <path> - Input image (PNG/JPEG)
--labels <list> - Comma-separated labels for zero-shot
--text <string> - Text for similarity scoring
--load <path> - Load binary (.qora-vision, includes vision + text)
--save <path> - Save full model binary (vision + text + scale/bias)
--f16 off Use F16 weights instead of Q4

Published Benchmarks

SigLIP 2 Base (224px) - Published Scores

Benchmark Score
ImageNet-1K Zero-shot ~69.8%
Multilingual support Yes (trained on WebLI)

SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.

Model Comparison

Model Params Image Size Architecture Zero-shot ImageNet
QORA-Vision (SigLIP 2 Base) 93M 224 ViT-B/16 ~69.8%
CLIP ViT-B/16 86M 224 ViT-B/16 68.3%
SigLIP Base (v1) 86M 224 ViT-B/16 66.2%
OpenCLIP ViT-B/16 86M 224 ViT-B/16 67.0%

Test Results

All tests run with Q4 quantization on CPU.

Test 1: Red Image Classification

Input: Solid red 224x224 image Labels: red, blue, green, yellow

Label Score
red 0.0022
blue 0.0000
green 0.0000
yellow 0.0000
Metric Value
Result PASS (correctly identified "red")
Vision Forward 42.0s
Embedding Dim 768, L2 norm = 1.0000

Test 2: Blue Image Classification

Input: Solid blue 224x224 image Labels: red, blue, green, yellow

Label Score
red 0.0000
blue 0.0014
green 0.0000
yellow 0.0000
Metric Value
Result PASS (correctly identified "blue")
Vision Forward 31.5s

Test 3: Green Image with Natural Language Labels

Input: Solid green 224x224 image Labels: "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"

Label Score
a photo of a cat 0.0000
a photo of a dog 0.0000
a solid green image 0.0176
a landscape 0.0000
Metric Value
Result PASS (correctly identified natural language description)
Vision Forward 39.2s
Note Highest score by far, demonstrating text understanding

Test Summary

Test Input Best Label Correct? Score
Color (red) Solid red "red" PASS 0.0022
Color (blue) Solid blue "blue" PASS 0.0014
NL Description Solid green "a solid green image" PASS 0.0176
Overall 3/3 (100%)

Performance

Metric Value
Binary Load ~115ms (full model, 210 MB)
Safetensors Load ~11-20s (from safetensors)
Vision Forward ~13-20s (196 tokens, 12 layers)
Text Forward ~5s per label
Total (4 labels) ~33-55s
Memory (Vision Q4) 58 MB
Memory (Text Q4) 151 MB
Binary Save ~2s (210 MB)

QORA Model Family

Engine Model Params Size (Q4) Purpose
QORA SmolLM3-3B 3.07B 1.68 GB Text generation, reasoning, chat
QORA-TTS Qwen3-TTS 1.84B 1.5 GB Text-to-speech synthesis
QORA-Vision (Image) SigLIP 2 Base 93M 58 MB Image embeddings, zero-shot classification
QORA-Vision (Video) ViViT Base 89M 60 MB Video action classification

Built with QORA - Pure Rust AI Inference

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qoranet/QORA-Vision-Image

Finetuned
(118)
this model

Evaluation results