| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - rust |
| | - cpu-inference |
| | - quantized |
| | - q4 |
| | - video-classification |
| | - action-recognition |
| | - vivit |
| | - video-transformer |
| | - pure-rust |
| | - no-python |
| | - no-cuda |
| | - kinetics-400 |
| | base_model: google/vivit-b-16x2-kinetics400 |
| | library_name: qora |
| | pipeline_tag: video-classification |
| | model-index: |
| | - name: QORA-Vision-Video |
| | results: |
| | - task: |
| | type: video-classification |
| | dataset: |
| | name: Kinetics-400 |
| | type: kinetics-400 |
| | metrics: |
| | - name: Top-1 Accuracy |
| | type: accuracy |
| | value: 79.3 |
| | --- |
| | |
| | # QORA-Vision (Video) - Native Rust Video Classifier |
| |
|
| | Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies. |
| |
|
| | ## Overview |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | **Engine** | QORA-Vision (Pure Rust) | |
| | | **Base Model** | ViViT-B/16x2 (google/vivit-b-16x2-kinetics400) | |
| | | **Parameters** | ~89M | |
| | | **Quantization** | Q4 (4-bit symmetric, group_size=32) | |
| | | **Model Size** | 60 MB (Q4 binary) | |
| | | **Executable** | 4.4 MB | |
| | | **Input** | 32 frames x 224x224 RGB video | |
| | | **Output** | 768-dim embeddings + 400-class logits | |
| | | **Classes** | 400 (Kinetics-400 action categories) | |
| | | **Platform** | Windows x86_64 (CPU-only) | |
| |
|
| | ## Architecture |
| |
|
| | ### ViViT-B/16x2 (Video Vision Transformer) |
| |
|
| | | Component | Details | |
| | |-----------|---------| |
| | | **Backbone** | 12-layer ViT-Base transformer | |
| | | **Hidden Size** | 768 | |
| | | **Attention Heads** | 12 (head_dim=64) | |
| | | **MLP (Intermediate)** | 3,072 (GELU-Tanh activation) | |
| | | **Tubelet Size** | [2, 16, 16] (temporal, height, width) | |
| | | **Input Frames** | 32 | |
| | | **Patches per Frame** | 14 x 14 = 196 | |
| | | **Total Tubelets** | 16 x 14 x 14 = 3,136 | |
| | | **Sequence Length** | 3,137 (3,136 tubelets + 1 CLS token) | |
| | | **Normalization** | LayerNorm with bias (eps=1e-6) | |
| | | **Attention** | Bidirectional (no causal mask) | |
| | | **Position Encoding** | Learned [3137, 768] | |
| | | **Classifier** | Linear(768, 400) | |
| | |
| | ### Key Design: Tubelet Embedding |
| | |
| | Unlike image ViTs that use 2D patches, ViViT uses **3D tubelets** — spatiotemporal volumes that capture both spatial and temporal information: |
| | |
| | ``` |
| | Video [3, 32, 224, 224] (C, T, H, W) |
| | → Extract tubelets [3, 2, 16, 16] = 1,536 values each |
| | → 16 temporal × 14 height × 14 width = 3,136 tubelets |
| | → GEMM: [3136, 1536] × [1536, 768] → [3136, 768] |
| | → Prepend CLS token → [3137, 768] |
| | ``` |
| | |
| | ## Pipeline |
| | |
| | ``` |
| | Video (32 frames × 224×224) |
| | → Tubelet Embedding (3D Conv: [2,16,16]) |
| | → 3,136 tubelets + CLS token = 3,137 sequence |
| | → Add Position Embeddings [3137, 768] |
| | → 12x ViT Transformer Layers (bidirectional) |
| | → Final LayerNorm |
| | → CLS token → Linear(768, 400) |
| | → Kinetics-400 logits |
| | ``` |
| | |
| | ## Files |
| | |
| | ``` |
| | vivit-model/ |
| | qora-vision.exe - 4.4 MB Inference engine |
| | model.qora-vision - 60 MB Video model (Q4) |
| | config.json - 293 B QORA-branded config |
| | README.md - This file |
| | ``` |
| | |
| | ## Usage |
| | |
| | ```bash |
| | # Classify from frame directory (fast, from binary) |
| | qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/ |
| |
|
| | # Classify from video file (requires ffmpeg) |
| | qora-vision.exe vivit --load model.qora-vision --video clip.mp4 |
| |
|
| | # Load from safetensors (slow, first time) |
| | qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/ |
| | |
| | # Save binary for fast loading |
| | qora-vision.exe vivit --model-path ../ViViT/ --save model.qora-vision |
| | ``` |
| | |
| | ### CLI Arguments |
| | |
| | | Flag | Default | Description | |
| | |------|---------|-------------| |
| | | `--model-path <path>` | `.` | Path to model directory (safetensors) | |
| | | `--frames <dir>` | - | Directory of 32 JPEG/PNG frames | |
| | | `--video <file>` | - | Video file (extracts frames via ffmpeg) | |
| | | `--load <path>` | - | Load binary (.qora-vision) | |
| | | `--save <path>` | - | Save binary | |
| | | `--f16` | off | Use F16 weights instead of Q4 | |
| | |
| | ### Input Requirements |
| | |
| | - **32 frames** at 224x224 resolution |
| | - Frames are uniformly sampled from the video |
| | - Each frame: resize shortest edge to 224, center crop |
| | - Normalize: `(pixel/255 - 0.5) / 0.5` = range [-1, 1] |
| | |
| | ## Published Benchmarks |
| | |
| | ### ViViT (Original Paper - ICCV 2021) |
| | |
| | | Model Variant | Kinetics-400 Top-1 | Top-5 | Views | |
| | |---------------|-------------------|-------|-------| |
| | | **ViViT-B/16x2 (Factorised)** | **79.3%** | **93.4%** | 1x3 | |
| | | ViViT-L/16x2 (Factorised) | 81.7% | 93.8% | 1x3 | |
| | | ViViT-H/14x2 (JFT pretrained) | 84.9% | 95.8% | 4x3 | |
| | |
| | ### Comparison with Other Video Models |
| | |
| | | Model | Params | Kinetics-400 Top-1 | Architecture | |
| | |-------|--------|-------------------|--------------| |
| | | **QORA-Vision (ViViT-B/16x2)** | 89M | 79.3% | Video ViT (tubelets) | |
| | | TimeSformer-B | 121M | 78.0% | Divided attention | |
| | | Video Swin-T | 28M | 78.8% | 3D shifted windows | |
| | | SlowFast R101-8x8 | 53M | 77.6% | Two-stream CNN | |
| | | X3D-XXL | 20M | 80.4% | Efficient 3D CNN | |
| | |
| | ## Test Results |
| | |
| | ### Test: 32 Synthetic Frames (Color Gradient) |
| | |
| | **Input:** 32 test frames (224x224, color gradient from red to blue) |
| | |
| | **Output:** |
| | ``` |
| | Top-5 predictions: |
| | #1: class 169 (score: 4.5807) |
| | #2: class 346 (score: 4.2157) |
| | #3: class 84 (score: 3.3206) |
| | #4: class 107 (score: 3.2053) |
| | #5: class 245 (score: 2.5995) |
| | ``` |
| | |
| | | Metric | Value | |
| | |--------|-------| |
| | | Tubelets | 3,136 patches | |
| | | Sequence Length | 3,137 (+ CLS) | |
| | | Embedding | dim=768, L2 norm=17.0658 | |
| | | Forward Pass | ~726s (12 layers x 12 heads, 3137x3137 attention) | |
| | | Binary Load | 30ms (from .qora-vision) | |
| | | Safetensors Load | ~5s (from safetensors) | |
| | | Model Memory | 60 MB (Q4) | |
| | | Binary Save | ~70ms | |
| | | Result | PASS (valid predictions with correct logit distribution) | |
| | |
| | ### Performance Notes |
| | |
| | The forward pass time (~726s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model — GPU acceleration would dramatically improve this. |
| | |
| | | Component | Time | |
| | |-----------|------| |
| | | Tubelet Embedding | ~0.1s | |
| | | Attention (per layer) | ~60s (3137x3137 matrix) | |
| | | 12 Layers Total | ~726s | |
| | | Final Classifier | <1s | |
| | |
| | ## Kinetics-400 Classes |
| | |
| | The model classifies videos into 400 human action categories including: |
| | |
| | *Sports:* basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... |
| | *Daily activities:* cooking, eating, drinking, brushing teeth, washing dishes... |
| | *Music:* playing guitar, piano, drums, violin, saxophone... |
| | *Dance:* ballet, breakdancing, salsa, tap dancing... |
| | *Other:* driving car, riding horse, flying kite, blowing candles... |
| | |
| | Full class list: [Kinetics-400 Labels](https://github.com/deepmind/kinetics-i3d/blob/master/data/label_map.txt) |
| | |
| | ## QORA Model Family |
| | |
| | | Engine | Model | Params | Size (Q4) | Purpose | |
| | |--------|-------|--------|-----------|---------| |
| | | **QORA** | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat | |
| | | **QORA-TTS** | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis | |
| | | **QORA-Vision (Image)** | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification | |
| | | **QORA-Vision (Video)** | ViViT Base | 89M | 60 MB | Video action classification | |
| | |
| | --- |
| | |
| | *Built with QORA - Pure Rust AI Inference* |
| | |