QORA-Vision-Image / README.md
drdraq's picture
Upload README.md with huggingface_hub
2902126 verified
---
language:
- en
- multilingual
license: apache-2.0
tags:
- rust
- cpu-inference
- quantized
- q4
- image-classification
- zero-shot-classification
- image-embedding
- siglip
- vision-transformer
- pure-rust
- no-python
- no-cuda
- contrastive-learning
base_model: google/siglip2-base-patch16-224
library_name: qora
pipeline_tag: zero-shot-image-classification
model-index:
- name: QORA-Vision-Image
results:
- task:
type: zero-shot-image-classification
dataset:
name: ImageNet-1K
type: imagenet-1k
metrics:
- name: Zero-shot Accuracy
type: accuracy
value: 69.8
---
# QORA-Vision (Image) - Native Rust Image Encoder
Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.
## Overview
| Property | Value |
|----------|-------|
| **Engine** | QORA-Vision (Pure Rust) |
| **Base Model** | SigLIP 2 Base (google/siglip2-base-patch16-224) |
| **Vision Params** | ~93M |
| **Text Params** | ~283M (256K vocab) |
| **Quantization** | Q4 (4-bit symmetric, group_size=32) |
| **Model Size** | 210 MB (Q4 binary, vision + text) |
| **Executable** | 4.4 MB |
| **Input** | 224x224 RGB images (PNG/JPEG) |
| **Output** | 768-dim embeddings + zero-shot classification scores |
| **Platform** | Windows x86_64 (CPU-only) |
## Architecture
### Vision Encoder (12-layer ViT-Base)
| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Attention Heads** | 12 (head_dim=64) |
| **MLP (Intermediate)** | 3,072 (GELU-Tanh activation) |
| **Patch Size** | 16x16 (non-overlapping) |
| **Sequence Length** | 196 patches (14x14 grid) |
| **Normalization** | LayerNorm with bias (eps=1e-6) |
| **Attention** | Bidirectional (no causal mask) |
| **Position Encoding** | Learned position embeddings |
| **Pooling** | MAP (Multi-head Attention Pooling) |
### Text Encoder (12-layer ViT-Base)
| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Vocabulary** | 256,000 tokens |
| **Max Position** | 64 tokens |
| **Pooling** | Last token + linear head |
### Contrastive Scoring
```
score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)
```
## Pipeline
```
Image (224x224) → Patch Embedding (196 patches)
→ Add Position Embeddings
→ 12x ViT Transformer Layers (bidirectional)
→ Post-LayerNorm
→ MAP Pooling (cross-attention with learned probe)
→ L2 Normalize
→ 768-dim Image Embedding
Text → Tokenize → Token + Position Embedding
→ 12x ViT Transformer Layers
→ Final LayerNorm (last token)
→ Linear Head
→ L2 Normalize
→ 768-dim Text Embedding
Score = sigmoid(cosine_sim * exp(scale) + bias)
```
## Files
```
siglip-model/
qora-vision.exe - 4.4 MB Inference engine
model.qora-vision - 210 MB Full model (vision + text, Q4)
tokenizer.json - 33 MB Text tokenizer (256K vocab)
config.json - 611 B QORA-branded config
README.md - This file
```
## Usage
```bash
# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"
# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"
# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg
# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"
# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision
```
### CLI Arguments
| Flag | Default | Description |
|------|---------|-------------|
| `--model-path <path>` | `.` | Path to model directory (safetensors) |
| `--image <path>` | - | Input image (PNG/JPEG) |
| `--labels <list>` | - | Comma-separated labels for zero-shot |
| `--text <string>` | - | Text for similarity scoring |
| `--load <path>` | - | Load binary (.qora-vision, includes vision + text) |
| `--save <path>` | - | Save full model binary (vision + text + scale/bias) |
| `--f16` | off | Use F16 weights instead of Q4 |
## Published Benchmarks
### SigLIP 2 Base (224px) - Published Scores
| Benchmark | Score |
|-----------|-------|
| **ImageNet-1K Zero-shot** | ~69.8% |
| **Multilingual support** | Yes (trained on WebLI) |
SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.
### Model Comparison
| Model | Params | Image Size | Architecture | Zero-shot ImageNet |
|-------|--------|------------|-------------|-------------------|
| **QORA-Vision (SigLIP 2 Base)** | 93M | 224 | ViT-B/16 | ~69.8% |
| CLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 68.3% |
| SigLIP Base (v1) | 86M | 224 | ViT-B/16 | 66.2% |
| OpenCLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 67.0% |
## Test Results
All tests run with Q4 quantization on CPU.
### Test 1: Red Image Classification
**Input:** Solid red 224x224 image
**Labels:** red, blue, green, yellow
| Label | Score |
|-------|-------|
| **red** | **0.0022** |
| blue | 0.0000 |
| green | 0.0000 |
| yellow | 0.0000 |
| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "red") |
| Vision Forward | 42.0s |
| Embedding Dim | 768, L2 norm = 1.0000 |
### Test 2: Blue Image Classification
**Input:** Solid blue 224x224 image
**Labels:** red, blue, green, yellow
| Label | Score |
|-------|-------|
| red | 0.0000 |
| **blue** | **0.0014** |
| green | 0.0000 |
| yellow | 0.0000 |
| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "blue") |
| Vision Forward | 31.5s |
### Test 3: Green Image with Natural Language Labels
**Input:** Solid green 224x224 image
**Labels:** "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"
| Label | Score |
|-------|-------|
| a photo of a cat | 0.0000 |
| a photo of a dog | 0.0000 |
| **a solid green image** | **0.0176** |
| a landscape | 0.0000 |
| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified natural language description) |
| Vision Forward | 39.2s |
| Note | Highest score by far, demonstrating text understanding |
### Test Summary
| Test | Input | Best Label | Correct? | Score |
|------|-------|------------|----------|-------|
| Color (red) | Solid red | "red" | PASS | 0.0022 |
| Color (blue) | Solid blue | "blue" | PASS | 0.0014 |
| NL Description | Solid green | "a solid green image" | PASS | 0.0176 |
| **Overall** | | | **3/3 (100%)** | |
## Performance
| Metric | Value |
|--------|-------|
| **Binary Load** | ~115ms (full model, 210 MB) |
| **Safetensors Load** | ~11-20s (from safetensors) |
| **Vision Forward** | ~13-20s (196 tokens, 12 layers) |
| **Text Forward** | ~5s per label |
| **Total (4 labels)** | ~33-55s |
| **Memory (Vision Q4)** | 58 MB |
| **Memory (Text Q4)** | 151 MB |
| **Binary Save** | ~2s (210 MB) |
## QORA Model Family
| Engine | Model | Params | Size (Q4) | Purpose |
|--------|-------|--------|-----------|---------|
| **QORA** | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat |
| **QORA-TTS** | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis |
| **QORA-Vision (Image)** | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification |
| **QORA-Vision (Video)** | ViViT Base | 89M | 60 MB | Video action classification |
---
*Built with QORA - Pure Rust AI Inference*