README.md · qoranet/QORA-Vision-Image at main

File size: 7,795 Bytes

---
language:
  - en
  - multilingual
license: apache-2.0
tags:
  - rust
  - cpu-inference
  - quantized
  - q4
  - image-classification
  - zero-shot-classification
  - image-embedding
  - siglip
  - vision-transformer
  - pure-rust
  - no-python
  - no-cuda
  - contrastive-learning
base_model: google/siglip2-base-patch16-224
library_name: qora
pipeline_tag: zero-shot-image-classification
model-index:
  - name: QORA-Vision-Image
    results:
      - task:
          type: zero-shot-image-classification
        dataset:
          name: ImageNet-1K
          type: imagenet-1k
        metrics:
          - name: Zero-shot Accuracy
            type: accuracy
            value: 69.8
---

# QORA-Vision (Image) - Native Rust Image Encoder

Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.

## Overview

| Property | Value |
|----------|-------|
| **Engine** | QORA-Vision (Pure Rust) |
| **Base Model** | SigLIP 2 Base (google/siglip2-base-patch16-224) |
| **Vision Params** | ~93M |
| **Text Params** | ~283M (256K vocab) |
| **Quantization** | Q4 (4-bit symmetric, group_size=32) |
| **Model Size** | 210 MB (Q4 binary, vision + text) |
| **Executable** | 4.4 MB |
| **Input** | 224x224 RGB images (PNG/JPEG) |
| **Output** | 768-dim embeddings + zero-shot classification scores |
| **Platform** | Windows x86_64 (CPU-only) |

## Architecture

### Vision Encoder (12-layer ViT-Base)

| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Attention Heads** | 12 (head_dim=64) |
| **MLP (Intermediate)** | 3,072 (GELU-Tanh activation) |
| **Patch Size** | 16x16 (non-overlapping) |
| **Sequence Length** | 196 patches (14x14 grid) |
| **Normalization** | LayerNorm with bias (eps=1e-6) |
| **Attention** | Bidirectional (no causal mask) |
| **Position Encoding** | Learned position embeddings |
| **Pooling** | MAP (Multi-head Attention Pooling) |

### Text Encoder (12-layer ViT-Base)

| Component | Details |
|-----------|---------|
| **Layers** | 12 transformer layers |
| **Hidden Size** | 768 |
| **Vocabulary** | 256,000 tokens |
| **Max Position** | 64 tokens |
| **Pooling** | Last token + linear head |

### Contrastive Scoring

```
score = sigmoid(cosine_sim(image_embed, text_embed) * exp(logit_scale) + logit_bias)
```

## Pipeline

```
Image (224x224) → Patch Embedding (196 patches)
    → Add Position Embeddings
    → 12x ViT Transformer Layers (bidirectional)
    → Post-LayerNorm
    → MAP Pooling (cross-attention with learned probe)
    → L2 Normalize
    → 768-dim Image Embedding

Text → Tokenize → Token + Position Embedding
    → 12x ViT Transformer Layers
    → Final LayerNorm (last token)
    → Linear Head
    → L2 Normalize
    → 768-dim Text Embedding

Score = sigmoid(cosine_sim * exp(scale) + bias)
```

## Files

```
siglip-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 210 MB    Full model (vision + text, Q4)
  tokenizer.json       - 33 MB     Text tokenizer (256K vocab)
  config.json          - 611 B     QORA-branded config
  README.md            - This file
```

## Usage

```bash
# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"

# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"

# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg

# Load from safetensors (slow, first time)
qora-vision.exe siglip --model-path ../SigLIP2/ --image photo.jpg --labels "cat,dog,bird,car"

# Save binary for fast loading
qora-vision.exe siglip --model-path ../SigLIP2/ --save model.qora-vision
```

### CLI Arguments

| Flag | Default | Description |
|------|---------|-------------|
| `--model-path <path>` | `.` | Path to model directory (safetensors) |
| `--image <path>` | - | Input image (PNG/JPEG) |
| `--labels <list>` | - | Comma-separated labels for zero-shot |
| `--text <string>` | - | Text for similarity scoring |
| `--load <path>` | - | Load binary (.qora-vision, includes vision + text) |
| `--save <path>` | - | Save full model binary (vision + text + scale/bias) |
| `--f16` | off | Use F16 weights instead of Q4 |

## Published Benchmarks

### SigLIP 2 Base (224px) - Published Scores

| Benchmark | Score |
|-----------|-------|
| **ImageNet-1K Zero-shot** | ~69.8% |
| **Multilingual support** | Yes (trained on WebLI) |

SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.

### Model Comparison

| Model | Params | Image Size | Architecture | Zero-shot ImageNet |
|-------|--------|------------|-------------|-------------------|
| **QORA-Vision (SigLIP 2 Base)** | 93M | 224 | ViT-B/16 | ~69.8% |
| CLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 68.3% |
| SigLIP Base (v1) | 86M | 224 | ViT-B/16 | 66.2% |
| OpenCLIP ViT-B/16 | 86M | 224 | ViT-B/16 | 67.0% |

## Test Results

All tests run with Q4 quantization on CPU.

### Test 1: Red Image Classification

**Input:** Solid red 224x224 image
**Labels:** red, blue, green, yellow

| Label | Score |
|-------|-------|
| **red** | **0.0022** |
| blue | 0.0000 |
| green | 0.0000 |
| yellow | 0.0000 |

| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "red") |
| Vision Forward | 42.0s |
| Embedding Dim | 768, L2 norm = 1.0000 |

### Test 2: Blue Image Classification

**Input:** Solid blue 224x224 image
**Labels:** red, blue, green, yellow

| Label | Score |
|-------|-------|
| red | 0.0000 |
| **blue** | **0.0014** |
| green | 0.0000 |
| yellow | 0.0000 |

| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified "blue") |
| Vision Forward | 31.5s |

### Test 3: Green Image with Natural Language Labels

**Input:** Solid green 224x224 image
**Labels:** "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"

| Label | Score |
|-------|-------|
| a photo of a cat | 0.0000 |
| a photo of a dog | 0.0000 |
| **a solid green image** | **0.0176** |
| a landscape | 0.0000 |

| Metric | Value |
|--------|-------|
| Result | PASS (correctly identified natural language description) |
| Vision Forward | 39.2s |
| Note | Highest score by far, demonstrating text understanding |

### Test Summary

| Test | Input | Best Label | Correct? | Score |
|------|-------|------------|----------|-------|
| Color (red) | Solid red | "red" | PASS | 0.0022 |
| Color (blue) | Solid blue | "blue" | PASS | 0.0014 |
| NL Description | Solid green | "a solid green image" | PASS | 0.0176 |
| **Overall** | | | **3/3 (100%)** | |

## Performance

| Metric | Value |
|--------|-------|
| **Binary Load** | ~115ms (full model, 210 MB) |
| **Safetensors Load** | ~11-20s (from safetensors) |
| **Vision Forward** | ~13-20s (196 tokens, 12 layers) |
| **Text Forward** | ~5s per label |
| **Total (4 labels)** | ~33-55s |
| **Memory (Vision Q4)** | 58 MB |
| **Memory (Text Q4)** | 151 MB |
| **Binary Save** | ~2s (210 MB) |

## QORA Model Family

| Engine | Model | Params | Size (Q4) | Purpose |
|--------|-------|--------|-----------|---------|
| **QORA** | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat |
| **QORA-TTS** | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis |
| **QORA-Vision (Image)** | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification |
| **QORA-Vision (Video)** | ViViT Base | 89M | 60 MB | Video action classification |

---

*Built with QORA - Pure Rust AI Inference*