QORA-Vision (Video) - Native Rust Video Classifier

Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.

Overview

Property	Value
Engine	QORA-Vision (Pure Rust)
Base Model	ViViT-B/16x2 (google/vivit-b-16x2-kinetics400)
Parameters	~89M
Quantization	Q4 (4-bit symmetric, group_size=32)
Model Size	60 MB (Q4 binary)
Executable	4.4 MB
Input	32 frames x 224x224 RGB video
Output	768-dim embeddings + 400-class logits
Classes	400 (Kinetics-400 action categories)
Platform	Windows x86_64 (CPU-only)

Architecture

ViViT-B/16x2 (Video Vision Transformer)

Component	Details
Backbone	12-layer ViT-Base transformer
Hidden Size	768
Attention Heads	12 (head_dim=64)
MLP (Intermediate)	3,072 (GELU-Tanh activation)
Tubelet Size	[2, 16, 16] (temporal, height, width)
Input Frames	32
Patches per Frame	14 x 14 = 196
Total Tubelets	16 x 14 x 14 = 3,136
Sequence Length	3,137 (3,136 tubelets + 1 CLS token)
Normalization	LayerNorm with bias (eps=1e-6)
Attention	Bidirectional (no causal mask)
Position Encoding	Learned [3137, 768]
Classifier	Linear(768, 400)

Key Design: Tubelet Embedding

Unlike image ViTs that use 2D patches, ViViT uses 3D tubelets — spatiotemporal volumes that capture both spatial and temporal information:

Video [3, 32, 224, 224] (C, T, H, W)
  → Extract tubelets [3, 2, 16, 16] = 1,536 values each
  → 16 temporal × 14 height × 14 width = 3,136 tubelets
  → GEMM: [3136, 1536] × [1536, 768] → [3136, 768]
  → Prepend CLS token → [3137, 768]

Pipeline

Video (32 frames × 224×224)
    → Tubelet Embedding (3D Conv: [2,16,16])
    → 3,136 tubelets + CLS token = 3,137 sequence
    → Add Position Embeddings [3137, 768]
    → 12x ViT Transformer Layers (bidirectional)
    → Final LayerNorm
    → CLS token → Linear(768, 400)
    → Kinetics-400 logits

Files

vivit-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 60 MB     Video model (Q4)
  config.json          - 293 B     QORA-branded config
  README.md            - This file

Usage

# Classify from frame directory (fast, from binary)
qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/

# Classify from video file (requires ffmpeg)
qora-vision.exe vivit --load model.qora-vision --video clip.mp4

# Load from safetensors (slow, first time)
qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/

# Save binary for fast loading
qora-vision.exe vivit --model-path ../ViViT/ --save model.qora-vision

CLI Arguments

Flag	Default	Description
`--model-path <path>`	`.`	Path to model directory (safetensors)
`--frames <dir>`	-	Directory of 32 JPEG/PNG frames
`--video <file>`	-	Video file (extracts frames via ffmpeg)
`--load <path>`	-	Load binary (.qora-vision)
`--save <path>`	-	Save binary
`--f16`	off	Use F16 weights instead of Q4

Input Requirements

32 frames at 224x224 resolution
Frames are uniformly sampled from the video
Each frame: resize shortest edge to 224, center crop
Normalize: (pixel/255 - 0.5) / 0.5 = range [-1, 1]

Published Benchmarks

ViViT (Original Paper - ICCV 2021)

Model Variant	Kinetics-400 Top-1	Top-5	Views
ViViT-B/16x2 (Factorised)	79.3%	93.4%	1x3
ViViT-L/16x2 (Factorised)	81.7%	93.8%	1x3
ViViT-H/14x2 (JFT pretrained)	84.9%	95.8%	4x3

Comparison with Other Video Models

Model	Params	Kinetics-400 Top-1	Architecture
QORA-Vision (ViViT-B/16x2)	89M	79.3%	Video ViT (tubelets)
TimeSformer-B	121M	78.0%	Divided attention
Video Swin-T	28M	78.8%	3D shifted windows
SlowFast R101-8x8	53M	77.6%	Two-stream CNN
X3D-XXL	20M	80.4%	Efficient 3D CNN

Test Results

Test: 32 Synthetic Frames (Color Gradient)

Input: 32 test frames (224x224, color gradient from red to blue)

Output:

Top-5 predictions:
  #1: class 169 (score: 4.5807)
  #2: class 346 (score: 4.2157)
  #3: class 84  (score: 3.3206)
  #4: class 107 (score: 3.2053)
  #5: class 245 (score: 2.5995)

Metric	Value
Tubelets	3,136 patches
Sequence Length	3,137 (+ CLS)
Embedding	dim=768, L2 norm=17.0658
Forward Pass	~726s (12 layers x 12 heads, 3137x3137 attention)
Binary Load	30ms (from .qora-vision)
Safetensors Load	~5s (from safetensors)
Model Memory	60 MB (Q4)
Binary Save	~70ms
Result	PASS (valid predictions with correct logit distribution)

Performance Notes

The forward pass time (~726s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model — GPU acceleration would dramatically improve this.

Component	Time
Tubelet Embedding	~0.1s
Attention (per layer)	~60s (3137x3137 matrix)
12 Layers Total	~726s
Final Classifier	<1s

Kinetics-400 Classes

The model classifies videos into 400 human action categories including:

Sports: basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... Daily activities: cooking, eating, drinking, brushing teeth, washing dishes... Music: playing guitar, piano, drums, violin, saxophone... Dance: ballet, breakdancing, salsa, tap dancing... Other: driving car, riding horse, flying kite, blowing candles...

Full class list: Kinetics-400 Labels

QORA Model Family

Engine	Model	Params	Size (Q4)	Purpose
QORA	SmolLM3-3B	3.07B	1.68 GB	Text generation, reasoning, chat
QORA-TTS	Qwen3-TTS	1.84B	1.5 GB	Text-to-speech synthesis
QORA-Vision (Image)	SigLIP 2 Base	93M	58 MB	Image embeddings, zero-shot classification
QORA-Vision (Video)	ViViT Base	89M	60 MB	Video action classification

Built with QORA - Pure Rust AI Inference

Downloads last month: 4

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for qoranet/QORA-Vision-Video

Base model

google/vivit-b-16x2-kinetics400

Finetuned

(77)

this model

Evaluation results

Top-1 Accuracy on Kinetics-400
self-reported

79.300