QORA-Vision (Video) - Native Rust Video Classifier

Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.

Overview

Property Value
Engine QORA-Vision (Pure Rust)
Base Model ViViT-B/16x2 (google/vivit-b-16x2-kinetics400)
Parameters ~89M
Quantization Q4 (4-bit symmetric, group_size=32)
Model Size 60 MB (Q4 binary)
Executable 4.4 MB
Input 32 frames x 224x224 RGB video
Output 768-dim embeddings + 400-class logits
Classes 400 (Kinetics-400 action categories)
Platform Windows x86_64 (CPU-only)

Architecture

ViViT-B/16x2 (Video Vision Transformer)

Component Details
Backbone 12-layer ViT-Base transformer
Hidden Size 768
Attention Heads 12 (head_dim=64)
MLP (Intermediate) 3,072 (GELU-Tanh activation)
Tubelet Size [2, 16, 16] (temporal, height, width)
Input Frames 32
Patches per Frame 14 x 14 = 196
Total Tubelets 16 x 14 x 14 = 3,136
Sequence Length 3,137 (3,136 tubelets + 1 CLS token)
Normalization LayerNorm with bias (eps=1e-6)
Attention Bidirectional (no causal mask)
Position Encoding Learned [3137, 768]
Classifier Linear(768, 400)

Key Design: Tubelet Embedding

Unlike image ViTs that use 2D patches, ViViT uses 3D tubelets β€” spatiotemporal volumes that capture both spatial and temporal information:

Video [3, 32, 224, 224] (C, T, H, W)
  β†’ Extract tubelets [3, 2, 16, 16] = 1,536 values each
  β†’ 16 temporal Γ— 14 height Γ— 14 width = 3,136 tubelets
  β†’ GEMM: [3136, 1536] Γ— [1536, 768] β†’ [3136, 768]
  β†’ Prepend CLS token β†’ [3137, 768]

Pipeline

Video (32 frames Γ— 224Γ—224)
    β†’ Tubelet Embedding (3D Conv: [2,16,16])
    β†’ 3,136 tubelets + CLS token = 3,137 sequence
    β†’ Add Position Embeddings [3137, 768]
    β†’ 12x ViT Transformer Layers (bidirectional)
    β†’ Final LayerNorm
    β†’ CLS token β†’ Linear(768, 400)
    β†’ Kinetics-400 logits

Files

vivit-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 60 MB     Video model (Q4)
  config.json          - 293 B     QORA-branded config
  README.md            - This file

Usage

# Classify from frame directory (fast, from binary)
qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/

# Classify from video file (requires ffmpeg)
qora-vision.exe vivit --load model.qora-vision --video clip.mp4

# Load from safetensors (slow, first time)
qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/

# Save binary for fast loading
qora-vision.exe vivit --model-path ../ViViT/ --save model.qora-vision

CLI Arguments

Flag Default Description
--model-path <path> . Path to model directory (safetensors)
--frames <dir> - Directory of 32 JPEG/PNG frames
--video <file> - Video file (extracts frames via ffmpeg)
--load <path> - Load binary (.qora-vision)
--save <path> - Save binary
--f16 off Use F16 weights instead of Q4

Input Requirements

  • 32 frames at 224x224 resolution
  • Frames are uniformly sampled from the video
  • Each frame: resize shortest edge to 224, center crop
  • Normalize: (pixel/255 - 0.5) / 0.5 = range [-1, 1]

Published Benchmarks

ViViT (Original Paper - ICCV 2021)

Model Variant Kinetics-400 Top-1 Top-5 Views
ViViT-B/16x2 (Factorised) 79.3% 93.4% 1x3
ViViT-L/16x2 (Factorised) 81.7% 93.8% 1x3
ViViT-H/14x2 (JFT pretrained) 84.9% 95.8% 4x3

Comparison with Other Video Models

Model Params Kinetics-400 Top-1 Architecture
QORA-Vision (ViViT-B/16x2) 89M 79.3% Video ViT (tubelets)
TimeSformer-B 121M 78.0% Divided attention
Video Swin-T 28M 78.8% 3D shifted windows
SlowFast R101-8x8 53M 77.6% Two-stream CNN
X3D-XXL 20M 80.4% Efficient 3D CNN

Test Results

Test: 32 Synthetic Frames (Color Gradient)

Input: 32 test frames (224x224, color gradient from red to blue)

Output:

Top-5 predictions:
  #1: class 169 (score: 4.5807)
  #2: class 346 (score: 4.2157)
  #3: class 84  (score: 3.3206)
  #4: class 107 (score: 3.2053)
  #5: class 245 (score: 2.5995)
Metric Value
Tubelets 3,136 patches
Sequence Length 3,137 (+ CLS)
Embedding dim=768, L2 norm=17.0658
Forward Pass ~726s (12 layers x 12 heads, 3137x3137 attention)
Binary Load 30ms (from .qora-vision)
Safetensors Load ~5s (from safetensors)
Model Memory 60 MB (Q4)
Binary Save ~70ms
Result PASS (valid predictions with correct logit distribution)

Performance Notes

The forward pass time (~726s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model β€” GPU acceleration would dramatically improve this.

Component Time
Tubelet Embedding ~0.1s
Attention (per layer) ~60s (3137x3137 matrix)
12 Layers Total ~726s
Final Classifier <1s

Kinetics-400 Classes

The model classifies videos into 400 human action categories including:

Sports: basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... Daily activities: cooking, eating, drinking, brushing teeth, washing dishes... Music: playing guitar, piano, drums, violin, saxophone... Dance: ballet, breakdancing, salsa, tap dancing... Other: driving car, riding horse, flying kite, blowing candles...

Full class list: Kinetics-400 Labels

QORA Model Family

Engine Model Params Size (Q4) Purpose
QORA SmolLM3-3B 3.07B 1.68 GB Text generation, reasoning, chat
QORA-TTS Qwen3-TTS 1.84B 1.5 GB Text-to-speech synthesis
QORA-Vision (Image) SigLIP 2 Base 93M 58 MB Image embeddings, zero-shot classification
QORA-Vision (Video) ViViT Base 89M 60 MB Video action classification

Built with QORA - Pure Rust AI Inference

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qoranet/QORA-Vision-Video

Finetuned
(77)
this model

Evaluation results