--- language: - en license: apache-2.0 tags: - rust - cpu-inference - quantized - q4 - video-classification - action-recognition - vivit - video-transformer - pure-rust - no-python - no-cuda - kinetics-400 base_model: google/vivit-b-16x2-kinetics400 library_name: qora pipeline_tag: video-classification model-index: - name: QORA-Vision-Video results: - task: type: video-classification dataset: name: Kinetics-400 type: kinetics-400 metrics: - name: Top-1 Accuracy type: accuracy value: 79.3 --- # QORA-Vision (Video) - Native Rust Video Classifier Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies. ## Overview | Property | Value | |----------|-------| | **Engine** | QORA-Vision (Pure Rust) | | **Base Model** | ViViT-B/16x2 (google/vivit-b-16x2-kinetics400) | | **Parameters** | ~89M | | **Quantization** | Q4 (4-bit symmetric, group_size=32) | | **Model Size** | 60 MB (Q4 binary) | | **Executable** | 4.4 MB | | **Input** | 32 frames x 224x224 RGB video | | **Output** | 768-dim embeddings + 400-class logits | | **Classes** | 400 (Kinetics-400 action categories) | | **Platform** | Windows x86_64 (CPU-only) | ## Architecture ### ViViT-B/16x2 (Video Vision Transformer) | Component | Details | |-----------|---------| | **Backbone** | 12-layer ViT-Base transformer | | **Hidden Size** | 768 | | **Attention Heads** | 12 (head_dim=64) | | **MLP (Intermediate)** | 3,072 (GELU-Tanh activation) | | **Tubelet Size** | [2, 16, 16] (temporal, height, width) | | **Input Frames** | 32 | | **Patches per Frame** | 14 x 14 = 196 | | **Total Tubelets** | 16 x 14 x 14 = 3,136 | | **Sequence Length** | 3,137 (3,136 tubelets + 1 CLS token) | | **Normalization** | LayerNorm with bias (eps=1e-6) | | **Attention** | Bidirectional (no causal mask) | | **Position Encoding** | Learned [3137, 768] | | **Classifier** | Linear(768, 400) | ### Key Design: Tubelet Embedding Unlike image ViTs that use 2D patches, ViViT uses **3D tubelets** — spatiotemporal volumes that capture both spatial and temporal information: ``` Video [3, 32, 224, 224] (C, T, H, W) → Extract tubelets [3, 2, 16, 16] = 1,536 values each → 16 temporal × 14 height × 14 width = 3,136 tubelets → GEMM: [3136, 1536] × [1536, 768] → [3136, 768] → Prepend CLS token → [3137, 768] ``` ## Pipeline ``` Video (32 frames × 224×224) → Tubelet Embedding (3D Conv: [2,16,16]) → 3,136 tubelets + CLS token = 3,137 sequence → Add Position Embeddings [3137, 768] → 12x ViT Transformer Layers (bidirectional) → Final LayerNorm → CLS token → Linear(768, 400) → Kinetics-400 logits ``` ## Files ``` vivit-model/ qora-vision.exe - 4.4 MB Inference engine model.qora-vision - 60 MB Video model (Q4) config.json - 293 B QORA-branded config README.md - This file ``` ## Usage ```bash # Classify from frame directory (fast, from binary) qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/ # Classify from video file (requires ffmpeg) qora-vision.exe vivit --load model.qora-vision --video clip.mp4 # Load from safetensors (slow, first time) qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/ # Save binary for fast loading qora-vision.exe vivit --model-path ../ViViT/ --save model.qora-vision ``` ### CLI Arguments | Flag | Default | Description | |------|---------|-------------| | `--model-path ` | `.` | Path to model directory (safetensors) | | `--frames ` | - | Directory of 32 JPEG/PNG frames | | `--video ` | - | Video file (extracts frames via ffmpeg) | | `--load ` | - | Load binary (.qora-vision) | | `--save ` | - | Save binary | | `--f16` | off | Use F16 weights instead of Q4 | ### Input Requirements - **32 frames** at 224x224 resolution - Frames are uniformly sampled from the video - Each frame: resize shortest edge to 224, center crop - Normalize: `(pixel/255 - 0.5) / 0.5` = range [-1, 1] ## Published Benchmarks ### ViViT (Original Paper - ICCV 2021) | Model Variant | Kinetics-400 Top-1 | Top-5 | Views | |---------------|-------------------|-------|-------| | **ViViT-B/16x2 (Factorised)** | **79.3%** | **93.4%** | 1x3 | | ViViT-L/16x2 (Factorised) | 81.7% | 93.8% | 1x3 | | ViViT-H/14x2 (JFT pretrained) | 84.9% | 95.8% | 4x3 | ### Comparison with Other Video Models | Model | Params | Kinetics-400 Top-1 | Architecture | |-------|--------|-------------------|--------------| | **QORA-Vision (ViViT-B/16x2)** | 89M | 79.3% | Video ViT (tubelets) | | TimeSformer-B | 121M | 78.0% | Divided attention | | Video Swin-T | 28M | 78.8% | 3D shifted windows | | SlowFast R101-8x8 | 53M | 77.6% | Two-stream CNN | | X3D-XXL | 20M | 80.4% | Efficient 3D CNN | ## Test Results ### Test: 32 Synthetic Frames (Color Gradient) **Input:** 32 test frames (224x224, color gradient from red to blue) **Output:** ``` Top-5 predictions: #1: class 169 (score: 4.5807) #2: class 346 (score: 4.2157) #3: class 84 (score: 3.3206) #4: class 107 (score: 3.2053) #5: class 245 (score: 2.5995) ``` | Metric | Value | |--------|-------| | Tubelets | 3,136 patches | | Sequence Length | 3,137 (+ CLS) | | Embedding | dim=768, L2 norm=17.0658 | | Forward Pass | ~726s (12 layers x 12 heads, 3137x3137 attention) | | Binary Load | 30ms (from .qora-vision) | | Safetensors Load | ~5s (from safetensors) | | Model Memory | 60 MB (Q4) | | Binary Save | ~70ms | | Result | PASS (valid predictions with correct logit distribution) | ### Performance Notes The forward pass time (~726s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model — GPU acceleration would dramatically improve this. | Component | Time | |-----------|------| | Tubelet Embedding | ~0.1s | | Attention (per layer) | ~60s (3137x3137 matrix) | | 12 Layers Total | ~726s | | Final Classifier | <1s | ## Kinetics-400 Classes The model classifies videos into 400 human action categories including: *Sports:* basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... *Daily activities:* cooking, eating, drinking, brushing teeth, washing dishes... *Music:* playing guitar, piano, drums, violin, saxophone... *Dance:* ballet, breakdancing, salsa, tap dancing... *Other:* driving car, riding horse, flying kite, blowing candles... Full class list: [Kinetics-400 Labels](https://github.com/deepmind/kinetics-i3d/blob/master/data/label_map.txt) ## QORA Model Family | Engine | Model | Params | Size (Q4) | Purpose | |--------|-------|--------|-----------|---------| | **QORA** | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat | | **QORA-TTS** | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis | | **QORA-Vision (Image)** | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification | | **QORA-Vision (Video)** | ViViT Base | 89M | 60 MB | Video action classification | --- *Built with QORA - Pure Rust AI Inference*